Blog

CPU over-subscription by joblib.Parallel due to BLAS

As mentioned in a previous post, joblib (among other things) is a nice tool to easily parallelize for-loops via joblib.Parallel. However, in combination with BLAS-based libraries (like numpy) this can lead to unexpected heavy CPU over-subscription due to simple operations like a dot product spawning a multitude of threads. There is a bug report to solve or document this.

Behavior

Check out the following code example:

Here, joblib.Parallel runs func() sequentially since we set n_jobs=1. Thus, our script should only use one CPU. Well, in some cases it does not. Specifically, if an internal parallelization routine like BLAS kicks in. In this case, a simple dot product like X.dot(np.transpose(X)) may trigger a bunch of threads taking care of the matrix multiplication. Per default, this will take up as many CPUs as there are. Thus, as soon as we ramp up the job count n_jobs > 1, our CPUs gets helplessly over-subscribed. That is, we end up with n_cpus * n_jobs threads. This degrades the performance of each job and may further hinder the execution due to overhead introduced by thread switching (cf. joblib’s docs, Section 2.4.7)

Note, that I wrote that threads are only spawned “in some cases”. That is, theading will only kick in if

  • numpy is built against some parallelization library like BLAS (there are others as well like LAPCK or ATLAS)
  • X is large enough so that BLAS (or whatever underlying parallelization library numpy is built against) deems it worth it to spawn multiple threads. In my case, if I set X = np.random.rand(100,60) no threads will be spawned.

Current “solution”

Unfortunately this inherent parallelization (e.g., due to BLAS) can only be disabled by setting certain environment variables (see joblib’s documentation (see Section 2.4.7) for a list of relevant variables). This can for example be achieved using Python code directly:

With the os.environ statements we are telling whatever parallelization library numpy is compiled against to only use a single thread for its operations.

As an alternative to setting the environments in your code, you can set them directly before calling your script. For example:

Notes and further resources

Joblib’s own documentation

Joblib actually documents the CPU over-subscription issue (see their docs, Section 2.4.7) and claims that for joblib.Parallel they turn off threading by certain third-party libraries including for example numpy (at least for the loki parallelization backend). However, if I checked this by forcing the loky backend and in my case it still resulted in CPU over-subscription:

Links

There are also some more links I want to share which are relevant to the topic or where linked throughout this article:

And finally here is a link to a relevant bug report:

“Experimental setup”

For reference, the behavior described above was observed for Python3, joblib=0.13.1 and Ubuntu 18.04 with kernel Linux 4.15.0-43-generic x86_64.

“Bagging based on resampling with and without replacement is equivalent”, is it?

Recently, I found the following StackOverflow post as well as the corresponding paper that states that bagging based on resampling with and without replacement is equivalent in the scenarios considered by the authors. In particular, the paper states that they found:

An equivalence between bagging based on resampling with and without replacement: resample sizes M_\text{with} = \alpha_{with}N and M_\text{w/o} = \alpha_\text{w/o}N (for resampling with and without replacement, respectively) produce very similar bagged statistics if \alpha_\text{with} = \alpha_\text{w/o}/(1 - \alpha_\text{w/o}) . This approximate equality holds for each sample, not just on the average over samples. While our derivation is limited to U-statistics, the equivalence seems to hold more widely, as illustrated by our limited experiments with bagging cart trees.

in “Observations on Bagging” by Buja and Stuetzle (2006)

While the authors are optimistic that their finding may be general, this needs further investigation since their results are mainly limited to U-statistics. Also note that it is generally NOT possible to just replace sampling with replacement and sampling without replacement in any application scenario as argued in the notes on random forests below. Nevertheless, I think it is an interesting result. The following examples and notes try to illustrate what the paper “Observations on Bagging” by Buja and Stuetzle (2006) actually implicates and what it does not. Please feel free to comment and discuss 🙂

Illustrations

Here is some Python code illustrating the formula mentioned in the text:

It is apparent that, generally, replacement introduces variance, also that the formula from the paper correctly adjusts for that variance.

The following graph shows \alpha_\text{with} as a function of \alpha_\text{without}. The more of the overall data we sample without replacement, the more samples with replacement are required to get to the same results.

Note on random forests

For random forests, generally, the concept of replacement is considered essential. This is because the underlying concept of random forests is bagging to prevent overfitting, i.e., bagging builds an ensemble of estimators trained on data with a high variance (with regard to the training data they have seen).

In particular, random forests sample with replacement as many samples as present in the training data, that is, \alpha_\text{with}=1. According to the paper, this is equivalent to \alpha_\text{without}=\alpha_\text{with}/(1 + \alpha_\text{with})=1/2 in order to achieve the same variance as sampling with replacement. I haven’t tried this, but is sounds plausible. However, reducing the number of samples to train on by half, may get us in trouble if we only have a small number of samples to train on to begin with. This is because the individual estimators may not have enough data to learn anything.

Finally, if we trained random forests on data sampled without replacement, i.e., \alpha_\text{with}=1, this would mean that we have no variance in the data used to train the individual estimators of the ensemble. Thus the ensemble will become meaningless. By the way, if we wanted to have a \alpha_\text{without} matching \alpha_\text{with}=1, \alpha_\text{without} would approach infinity which is hinted on in the plot above and as can be seen from the corresponding formula.

Links

  • https://stats.stackexchange.com/questions/130662/what-is-the-effect-of-bootstap-resampling-in-bagging-algorithmensemble-learning
  • https://www.jstor.org/stable/24307547
  • https://www.quora.com/Why-does-random-forest-use-sampling-with-replacement-instead-of-without-replacement

Acknowledgements

Thanks to Florian Lemmerich for working on this with me 🙂

joblib.Parallel is overwriting variables

Joblib is a nice Python library for “lightweight pipelining”. This includes for example “easy simple parallel computing”. It is heavily used by scikit-learn to speed up for example machine learning algorithms. However, it has some quirks.

One of them is that joblib.Parallel (used for easily parallelizing for-loops) is overwriting certain variables defined by the outer scope. These variables include for example args, parser, exitcode, or spawn (see pierreglaser’s response here). This can be rather unexpected and cause confusion. There is a bug report asking to fix or to document this behavior. See the following code for an example:

In the previous code snippet you would expect get a list with two “5”s: [5,5]. Instead what you will get is a list of two Namespace object defined by the joblib.Parallel “context”.

For reference, this behavior was observed for Python3, joblib=0.13.1 and Ubuntu 18.04 with kernel Linux 4.15.0-43-generic x86_64.

Watch out for duplicate column names in pandas DataFrames

Pandas is a powerful data analysis engine. While it is most of the time easy to use and behaves as a good panda should, sometimes you just need to be aware of some details and their implications to stay on its friendly side. For example, a DataFrame

  • a) allows columns with duplicate names, and
  • b) when retrieving columns by such a name, pandas gives you a DataFrame that contains all columns with that name for each column name you selected.

This happens silently and may quickly and unexpectedly increase the size of your DataFrame quite significantly!

Consider the following example, with a dataset containing two feature groups (with duplicate feature names) and one target (e.g., for learning a linear regression model):

For this example, one would probably expect the feature DataFrame to contain only the features from group1. Thus, it would look like this:

However, due to the duplicate column names, our selected features (features_from_group1) contain the group1_a column name twice. Now, since pandas returns all columns matching for each given column name in features_from_group1, we a get a DataFrame with “too many” columns:

This makes sense, in a way. However, when you are not aware of it, you are quickly increasing the size of your DataFrame (or in the example above, the feature set to learn from) by quite a large margin without noticing. Depending on the amount of duplicate column names, this may blow up your memory, hinder your machine learning methods from learning efficiently (or from learning at all), or cause your HDD to fill up quickly when saving models 😉

SVGs are inline elements

So, I just found out that SVG elements are displayed inline by default. This has the “strange” effect that you get an overflow (which is not apparent, e.g., from the output of the Chrome debugging view) when you try to fill a parent element with a SVG. The following code shows how a scroll bar shows up for “no reason”:

To solve this issue all you have to do is to change the display  attribute of the SVG, e.g., to  block :

Try it yourself here: https://jsfiddle.net/mgbckr/ebvso38v/

The reason for this is nicely explained in a StackOverflow post like this:

inline or inline-block elements are called “inline” because they are intended to be laid out amongst lines of text. So, wherever they appear, space is reserved for the “descent”, which is the area underneath a line of text where the dangly parts of lowercase g’s, j’s, and y’s go.

 

Directly accessing PDFs from BibSonomy in TeXlipse

I am using BibSonomy for my for managing my references and TeXlipse as my LaTeX editor. While the latter is arguably a little outdated, it still works well and provides some features which other editors seem to lack (e.g., partial compilation not supported by TeXstudio or TeXmaker). Anyways, what I wanted was a way to directly access my PDFs from TeXlipse. It turned out there is an easy (if hacky) way of achieving this (of course there are other, cleaner ways such as writing an extension or plug into the existing BibSonomy extension). So here is what is possible now:

Highlight the Bib-Key (such as becker2016sparktrails) and hit F9 (or any custom shortcut). Voilà the PDF opens up.

As I mentioned the process to achieve this is a little hacky, and has some caveats:

  1. It is based on the External Tools functionality built into Eclipse, i.e., we take advantage of the shortcut  (F9) for the last used external tool meaning that if we use another external tool, we need to manually run our “Open PDF” external tool once, before the (F9) shortcut works again.
  2. We need to manually trigger the process of downloading the PDFs from BibSonomy (which need to be stored there in the first place, obviously).

Nevertheless, I found the process quite useful, so here is how to do it:

Howto

We will store our PDFs in a folder within our LaTeX project. I use build/papers . Then we use an external tool which uses the currently selected Bib-Key to open the appropriate PDF.

Downloading PDFs

For downloading the PDFs I use a Python 3 script which I have placed into a folder in my LaTeX project. I call this folder build . Paper are downloaded to build/papers . The python script is called as follows:

Here is the Python 3 source code. Note that this code currently only supports downloading up to 999 publications due to limitations in the BibSonomy API. This could be fixed by implementing a recursive downloading procedure.

 

Open PDFs

The configuration of the external tool ( Run -> External Tools -> External Tools Configuration ) is pretty easy, here my configuration using Okular as PDF viewer:

To configure a shortcut for this external tool, we use the Key Run Last Launched External Tool , which I set to F9. Here is a screenshot of my config:

Future Work

Here is some future work which may prove interesting, but I will probably never come around to do 😉

  • Shortcuts for specific external tools (probably needs an extension) in order to replace the usage of the “last used external tool” key
  • An extension for automatically downloading missing PDFs since we currently need to start this process by hand
  • Extending the already existing BibSonomy plug-in for TeXlipse to create an integrated experience
  • Adding something like this to TeXstudio or TeXmaker via JavaScript extension points, in order to support more recent TeX-editors

Power-law, Pareto, Zipf and Scale-Free distributions

I did some related work on human mobility these days and came across the terms of Power-Law, Pareto, Zipf’s and Scale-Free distributions all the time. And, shame on me, I did not know the “difference”. Indeed, it turned out that all these notions are words for the same thing as explained by

Power laws, Pareto distributions and Zipf’s law
M. Newman. Contemporary physics46 (5): 323-351 (2005)

In particular it says about the power-law, Zipf’s law and the Pareto distribution:

Since power-law cumulative distributions imply a powerlaw form for p(x), “Zipf’s law” and “Pareto distribution” are effectively synonymous with “power-law distribution”.

With regard to the scale-free aspect of power-law, it says:

A power-law distribution is also sometimes called a scale-free distribution. Why? Because a power law is the only distribution that is the same whatever scale we look at it on.

And, finally, a little fun fact:

Zipf’s law and the Pareto distribution differ from one another in the way the cumulative distribution is plotted—Zipf made his plots with x on the horizontal axis and P(x) on the vertical one; Pareto did it the other way around. This causes much confusion in the literature, but the data depicted in the plots are of course identical.

Google: transparent and in the academic realm

 

[…] we believe the issue of advertising causes enough mixed incentives that it is crucial to have a competitive search engine [Google] that is transparent and in the academic realm.

— Sergey Brin and Lawrence Page, 1998, The Anatomy of a Large-Scale Hypertextual Web Search Engine

Funny to think that this statement is from the founders of the company who makes most of its money through advertisement 🙂

Fun Fact: Who’s Browsing?

I just came by an explanation of the origin of the phrase “browsing [the web]”, which I thought was interesting because usually one never really thinks about something like this. In their article “Online text retrieval via browsing”, Cove and Walsh (1988)  explain its origin as follows:

The term “browsing” is usually applied to the actions of moving about a library and dipping into books, picking out bits and pieces of information of all kinds. The term is derived from the eating behavior of deer when selecting the fresh young shoots, and thus carries the connotation of selecting worthwhile and useful information.

A PhD thesis, LaTeX and me

For writing a PhD thesis in LaTeX, there are many package choices to make and a whole lot of little issues and pitfalls to handle. This starts with choosing the document class and … goes on and on. Here, I am collecting some points which took me more than a few moments, so I don’t have to go through the whole process again. And maybe my little list can help others to make their choices a little faster or avoid some of the mistakes I made.

Choosing the document class

Even though there are alternatives, for me, the main contenders for the document class were KOMA-Script class ( scrbook  or scrreprt ) and memoir. In the end I chose KOMA-Script as it seems to be less bloated (i.e., memoir emulates many well established packages instead of using them directly) and there really seems to be no reason not to use it 😉 However, as you can see, I am not going into minute details as the articles mentioned above were enough to sway me.

As for the subcategory I use scrbook . It seems that from the book, report or article classes (koma-script or not) there seems to be an awful lot of features which fit a PhD thesis more than the report or article class (see this discussion). Just to name a few, such features include two-page layout or more flexible numbering with regard to front, main and back matter. Also this article explicitly states that scrbook  is for dissertations.

How to handle subfigures

An interesting choice was which package to use for subfigures. I chose the subcaption package. As to why: the subfigure package seems to be old and is superseded by subfig. While the newest most flexible option seems to be a combination of the caption and the floatrow package. However, the subcaption package supposedly was a case study for the interfaces provided by the latter two packages and provides a more easy to use functionality. Indeed, the introduction of the subfigure package on CTAN specifically describes itself as “superseded […], […] and users may find the more recent subcaption package more satisfactory.” Discussions on StackExchange further support this notion. I think overall, it boils down to subfig supposedly having issues with hyperref, and subcaption being more simple than using a combination of caption and floatrow. Thus in the end subcaption seems to be the best choice.