Surviving in clusters

This post aims at sharing some practices that have made my life easier lately, when running experiments in clusters. Normally, in a lab environment one has access to shared computational resources, namely servers, that can be used for running experiments.

Connecting to a server

Connecting to those servers is the first thing to be done. Here, ssh is your friend. A couple of useful tips that have saved me a lot of typing are listed below:

  • ssh: In case you ssh often to the machine, consider copying your identification to it so that you don’t have to type your code every time you try to connect. Check ssh-copy-id. Simply put: ssh-copy-id your_username@machineName.xxxx.fr, where xxxx.fr stands for its address. If you have not generated a key beforehand, you must first do it. The process is really simple, check key-gen.
  • Connecting to a distant machine B, after first connecting to a first machine A. In this case the machine A is used as a bridge. This tip have saved me tons of typing.. You can automate the process of connecting to machine B, by adding the following lines in your ~/.ssh/config (create it, if it does not exist):
    • Host nameOfMachineB
      ProxyCommand ssh your_username@machineAnameAddress nc %h %p

Controlling the CPU cores used

The second issue I have faced, concerns the problem of controlling the number of cores my jobs use, when parallelizing things. Using shared resources requires accounting apart from your experiments, also for the others that may use these  resources and several times this boils down to controlling accurately the number of resources you use.

Numpy and OMP_NUM_THREADS

I mainly use Python and scikit-learn, so most of the things are done automatically from inside a program, by setting for example variables like “n_jobs” or even using pools of cpus via the multiprocessing library.

In the case of the sklearn API  though, things sometimes work differently from the expected. Several times, the problem I have faced was the following:  

Having set the “n_jobs”, when monitoring the cpu resources, the program used all the available.

Imaging setting “n_jobs=3” and monitoring your program using 24 cores, in bursts. After digging and stack-overflowing a lot, I found that this was due to numpy. Installing numpy automatically without building from the sources, results in setting the flag “OMP_NUM_THREADS” to a number that equals the max cores/threads of a machine. Then, when numpy performs inner products or other operations (using blas/atlas etc.) it may select to parallelize those calculations to accelerate; in the latter case by using up to “OMP_NUM_THREADS”. How to solve this: either by building numpy from the sources where one has complete control over such details (as well as blas/atlas libraries etc.) , or by exporting the “OMP_NUM_THREADS” variable prior to running her program. This is:

OMP_NUM_THREADS=4 python yourProgram.py

CPU affinity and taskset

Apart from the above, there is another way I have discovered recently. Again the problem concerns controlling the maximum number of CPU cores a program using. This is closely related to cpu affinity. I have foung that there is the “taskset” command that attaches a process to a particular core (or even better) to set of cores. For instance, to run a job of cores 1,2,3 and 4 you could:

taskset -c 1-4 python yourProgram.py

The advantages of taskset is the flexibility it offers, the complete control and the fact that is it language independent. For example, you can use it with java/python and so on and so forth. If you want to learn more for taskset, I have found this page really useful: taskset

Grid-searching

Machine learning involves a lot of grid-searching and tuning. Several times one must run the same script with different arguments or over different datasets in order to examine the behavior of a method under different settings. Lately, I have adopted an approach for this problem that consists of two steps:

  1. Writing the script by taking the values that may change often as arguments. For instance, for a topic modelling approach, the number of topics, the path to the learning dataset and the path to the test dataset can be such arguments.
  2. Running the script using “parallel”

“Parallel”, which is written in a perl, offers much control for such scenarios. I provide a couple of examples below to demonstrate its advantages:

parallel –jobs 4 python yourScripy.py {1} {2} ::: arg1_value1 arg1_value2 … arg1_valueN ::: arg2_value1 arg2_value2 … arg2_valueN

will run the python yourScript.py with all the possible combinations of the given arguments running up to 4 instances in parallel. {1} and {2} are placeholders for the arguments which are given in lists after the ‘:::’. You can combine it with nohup and a dump logging system by:

nohup parallel –jobs 4 python yourScripy.py {1} {2} ‘>’ log.{1}.{2} ::: arg1_value1 arg1_value2 … arg1_valueN ::: arg2_value1 arg2_value2 … arg2_valueN

and now you have a nice way of running scripts with several arguments, controlling the number of cores your application uses and logging things to be able to inspect your results.

That’s it.. Let me know below in the comments if you liked the post. Also, don’t hesitate to let me know what you are using  in such scenarios or even how these can be further improved!

Advertisements

ECIR 2016: Multilingual text classification

Our work “Multi-label, Multi-class Classification Using Polylingual Embeddings ”  has been accepted in the European Conference on Information Retrieval. In case you missed the paper, here is a short summary of the work we presented. It was presented as  a poster, you can have a look at it in slideshare.

The main question that motivates the paper, is whether parallel translations of a document can be used in order to create rich representations and, whether, given a task those new representations perform better than the monolingual. I named those “new” representations polylingual and let me explain why.  Simply, because these new representations, that have been generated by combining information from more than one languages. For convenience, assume that we operate in the word level. Departing from the space of each language (language-dependent space), e.g., English and French, we generate a new space. In the language-dependent spaces words were in some points, in the induced polylingual space, we map pairs of words (a word and its translation) in points.  This is also what the main poster figure explains. The intuition behind it, is that by combining languages one can create richer semantic representations that achieve word disambiguation etc.

How to do that? Since at that time I was in to distributed representations and word2vec, I decided to follow this path:

  1. Generate word2vec vectors for each language
  2. Apply the average composition function, to generate document representations from word representations. This means that given the words of a document, average their representations to obtain the document representation.
  3. Having the document representations in let’s say English and French, obtain the polylingual document representation using a denoising autoencoder. The denoising autoencoder learns a compressed representation of its inputs. I have compressed the representations by a factor of 0.7, which was experimentally tuned.
  4. Compare the performance on document classification with SVMs using tf-idf representations, monolingual distributed representations and polylingual distributed representations.

 

In the experiments I found that given only a few labeled data, polylingual representations perform the best. As more data become available though, standard tf-idf representations become competitive and outperform the polylingual.

 

Discussion: I have assumed access to parallel translations of texts to be classified. This is not quite realistic and I have used google translate to generate those translations, which is considered a state-of-the-art system. Here, the effect of the translation quality has to be further investigated. Also, it is to be noted that a simple composition function (average) has been used to obtain document representations. I plan to try it using better composition functions that either rely on more operations that single averaging like min and max, or use neural networks. Among these, I have tried paragraph vectors (the results are included in the paper), but they were not as competitive as word2vec + composition functions. Finally, I have used English and French as the language pairs, which raised some questions on how pairs like English and Chinese would perform. This is to be investigated in the future.

Conclusions: as machine translation systems improve, this work provided evidence that one can improve the performance on a task by fusion mechanisms. This related to multi-view representation learning. Also, this paper builds on distributed representations, a concept that is quite exciting due to the observations that those representations can capture semantic and linguistic similarities. I strongly believe that working on the representation learning direction is a very promising direction.

I would like to thank my thesis supervisor and co-author of the paper, Massih-Reza Amini.

IDA 2015: Efficient Model Selection

This is a short discussion with respect to our full paper “Efficient Model Selection for Regularized Classification” that was presented in the Intelligent Data Analysis Symposium of 2015. The main idea of the work is a model selection technique, an alternative to k-fold cross validation and hold out, that performs on par with cross validation but it is k times faster. The paper and the conference presentation are available.

The method uses unlabeled data to perform model selection. Unlike cross validation, it can leverage such data which in real scenarios are abundant whereas obtaining labeled data are expensive. Actually, the gain in complexity comes exactly from the availability of such data. In the paper, we propose a theorem that motivates a model selection algorithm. The theorem uses a quantification step, so I am first discussing quantification (also presented at ESSIR 2015).

Quantification is the task of predicting the prevalence of a class in unseen data. Prevalence stands for the marginal probability of the class. For instance, in a binary classification problem, with 10 documents in class A and 10 documents in class B, the prevalence of class A is 0.5 and the prevalence of class B is 0.5. This is different from classification, although the simplest approach is to just classify the instances and then count them. To illustrate the difference, imagine the following extreme situation: the worst possible classifier in the above problem that assigns the instances of class A to class B, and those of class B to class A. This is a very bad classifier (accuracy =0!) but a perfect quantifier.

Since the paper is accessible here , I will not provide in this post neither mathematical formulas nor experimental results; I will restrict the presentation to the underlying intuition. To do so, I will be using as an example a dataset that its categories follow the power-law distribution. This means that most of the classes are represented by only a few instances and there are a few that are very well represented and who actually dominate the data. Visualizing this in a log-log plot, it approaches a straight line.

Here is the intuition behind the paper’s theorem, and how this motivates a model selection algorithm. In a power-law classification setting, it is common for a classifier to assign the false positives to the high-prevalence classes. Hence, the induced prevalence of the well-represented classes in the unlabeled data increases. For the same reason, the prevalence of the less frequent classes on te same data decreases. Given, however, that the i.i.d. assumption holds between the training and the unlabeled data, the hyper-parameter value of a classifier that results in the best quantification performance also maximizes the bound of the paper.  Consequently, this motivates the choice of a classifier’s hyper-parameter and the model selection step. More details are available at the paper along with a solid experimental part that demonstrates the merits of the technique.

Limitations and future work: for this approach, we assume we are in the supervised classification setting where the i.i.d. assumption holds between the training, the unlabeled and the test data. Modifying the proposed bound to cope with non-i.i.d. settings is something to be further researched. Also, similar methods usually try to maximize lower bounds instead of upper bounds.  I am now trying to see how we can combine the upper bound with  a lower one as well as the implications of this.

Special thanks and credits to the paper co-authors: Ioannis Partalas, Eric Gaussier, Rohit Babbar and Massih-Reza Amini.

 

ESSIR 2015: some late thoughts

I recently attended the European Summer School on Information Retrieval (ESSIR 2015) which was organised in Thessaloniki, Greece from August 31st until September 4th. It was my first such event and I was quite enthusiastic when I registered. The everyday program was split in lectures (3 lectures of two hours each) from 9.00 until 17.00 and social events. There were 52 participants mostly from Europe and several lecturers. The summer school was organised by CERTH; although I did my undergraduate studies in Thessaloniki I had never visited CERTH and I was impressed by the facilities and the place in general.

I have spent my 1st PhD year studying Machine Learning (ML) and when registering I thought that attending an IR school would be beneficial: several IR papers use ML techniques as black boxes and interacting with this community could trigger interesting discussions. Also, some lectures such as “Multilingual Summarization”, “Machine Learning for IR” and “Opinion and Sentiment analysis” are quite interesting and relevant to what I have been doing. As a result there were two dimensions: the academic and the social.

I was exposed on several aspects of IR. Search, of course, is the major application and different topics around it were discussed. I was fascinated(!) by the fact that there are more than 100 evaluation measures used in the field! What is more, the community cannot easily reach an agreement on which are the most suitable for each task. I guess that an IR scientist that respects him/herself should propose his/her own measure.

Below are some notes from the lectures I found most interesting:

– Quantification
I had the opportunity to discuss with F. Sebastiani and attend his presentation. One of the parts of his research concerns quantification. Quantification, compared to classification, concerns how many instances are assigned to a class and not where each instance belongs. The idea is that when a company like Apple wishes to learn the opinions about the new iPhone they are interested in the general picture and not what each user believes. For example, given a dataset of tweets about the new iPhone, the quantification problem is expressed as: how many tweets are positive (and not which of the tweets are positive). Sebastiani also introduced the Task 4 of SemEval 2016 which tries to solve the above mentioned problem. I found the discussion interesting and the talk easy to follow.

– Statistical significance tests
I attended Evangelos Kanoulas lecture on how one can perform a statistical significance test in order to show that the obtained results are significantly different from the rest (baseline or state-of-the-art systems). Evangelos relaxed some of the test hypotheses by introducing distributions instead of fixed/assumed parameter values. By adding more degrees of freedom he demonstrated that experiments that were shown to be statistical significant when using standard values were not when the values were replaced by gaussian distributions. Also, after thorougly expaining how the tests work he sketched how one can design experiments to obtain statistical significance.

– Distributed representations (DRs)
The hype of embeddings was evident both in the summer school and in several of the SIGIR 2015 papers as pointed by several lecturers. One of the directions that I found interesting was using DRs to represent queries and documents before the retrieval stage. There is also work on kernels that are optimized for retrieval given that the features are encoded using Neural Networks. From discussions in the coffee breaks with students and lecturers I was given the impression that the community is trying to come up with a robust alternative to bag-of-words. But I don’t think that this quest is something new. However, the fact that DRs trained with Neural Networks yielded promising results in several tasks justifies this hype.

– Graph-of-words (GoW)
The talk of M. Vazirgiannis was devoted to GoW, a way to represent text using graphs. I found this idea promissing (I don’t know how novel it is) because it has some nice properties. Also the intuition behind it, in my opinion, is that it is a generalization of n-grams along with skip-grams. They have published several papers during the last 2-3 years demonstrating the use of GoW in several tasks (classification and summarization included) in top conferences.

Along with several other talks I also attended the Symposium on Future Directions in Information Access (FDIA). I discussed with the authors of the posters and I was impressed by the huge variety of work presented. Retrieval and its challenges in different languages, eco-friendly search where the search engine is connected with the hardware to save energy, and sentiment analysis using fractals are only some of them.

The social dimension and the activities are something that cannot be described in a blog post. I believe I met the majority of the participants and I discussed with them different ideas, problems and everyday life. Being there and having people who can understand (and even find interesting) your research was great! Being out with them at nights to eat and drink was even better! The fact that we used to go to CERTH with a bus because it is situated away from the center of Thessaloniki made me feel like a teenager in school; I already miss it. To be honest, I am convinced that it was the school’s participants and their attitude that made my participating in ESSIR as awesome and memorable. We will certainly meet in the upcoming conferences! I am looking forward to ECIR 2016..

I decided not to describe extensively either the week or the lectures to keep the post short. For anybody who is interested, most of the slides and probably videos will become available here. Have a great week!

Evaluating research.. an idealist’s view

Doing research in the academic environment is like pursuing excellence. This is the outcome of a recent discussion with some of my colleagues. We were arguing about conferences, journal publications and evaluation in general. And in the rest, I will try to explain myself.

At this point of my life I am in the beginning of my PhD. From what I have seen, obtaining such a degree requires publications. Publications serve a specific goal: they communicate problems, ideas and solutions so that others become aware of them, use them and further improve. I believe it is a slow process, however, I guess it is effective. It has several inherent problems. A main one is the replicability of the solutions by others but it is not the topic of the post. Also, the community seems to be aware of it and actively discusses it (e.g., replicability session in SIGIR etc.) or encourages making code and datasets public.

Back to the publications now. Publiching follows a loop: (i) you identify a problem to solve, (ii) you study the relevant literature and you struggle to come up with ideas or extensions to already proposed methods, (iii) you implement your ideas and you test them under some real scenarios and (iv) assuming you obtain interesting results you try to publish them for the whole community to access them. In the framework of a particular paper, as time passes you optimize each of the above steps, given your experience by the rest of the steps. In this process having ideas, being creative, hard-working, having a solid mathematical and technical (coding) background helps; you can save a lot of energy and time.

So how and why excellence is relevant? In my opinion the concept of excellence is in there from the beginning. However, one can most easily identify it in the evaluation that proceeds the publication step. Publishing in a top conference or in a journal with high impact ratio is not simple. It requires an interesting problem, a thorough study of the ideas and the work of other scientists in the field -you essentially build on others work-, a good idea with a solid mathematical construction and a strong experimental part. It also requires the ability to communicate your thoughts clearly and a way to convince with your writing.

This bunch of requirements is not easy to deal with. Other scientists (reviewers) stand against your work with a critical position trying to identify flaws. Each paper is assigned to several reviewers to maximize the probability that those flaws will be identified and make sure that the paper meets the quality standards. Creating such high quality work takes time and effort; normally a lot of people contribute with their ideas to the final result and it is submitted only when the authors feel that it fulfills the publication requirements. Even then, the authors using their experience they assess the quality and they submit it to the venue they believe it is the more appropriate. As a result, the objective is for A* conferences and high-impact journal to end up publishing significant contributions with the potential to generalize and being adopted widely; smaller conferences adopt the same criteria but for more targeted or focused contributions.

Excellence, as the quality of being extremely good, is reached through the repetition of this process. Mediocre works usually get rejected and the authors re-submit them after strengthening the weak parts as they were identified by the reviewers until the standards are met. This reveals the value of persistence in research: having a good idea and insisting on it is expected to pay off.

This is my idealistic understanding of what research evaluation should stand for. Since humans are involved the results are sometimes (many times?) frustrating or event not justified. Having those in mind,  I believe that one can profit a lot from the process by providing well-reasoned and civil arguments in rebuttal phases and also by keeping a positive attitude. After all, excellence is an art to be won by training!

Hello world!

This will be the first post in my blog and I am really happy about it. Mostly because it’s been a while I have this idea but it’s only now that I will try to make it happen. As I think of it now I will be publishing here things I find interesting: science, software, hacks.. Or even more general thoughts on travelling, life, education..

That’s it! Do not hesitate to contact me, ask questions or start a discussion. After all the goal is to share, to learn and to have fun.