CPU over-subscription by joblib.Parallel due to BLAS

As mentioned in a previous post, joblib (among other things) is a nice tool to easily parallelize for-loops via joblib.Parallel. However, in combination with BLAS-based libraries (like numpy) this can lead to unexpected heavy CPU over-subscription due to simple operations like a dot product spawning a multitude of threads. There is a bug report to solve or document this.

Behavior

Check out the following code example:

Here, joblib.Parallel runs func() sequentially since we set n_jobs=1. Thus, our script should only use one CPU. Well, in some cases it does not. Specifically, if an internal parallelization routine like BLAS kicks in. In this case, a simple dot product like X.dot(np.transpose(X)) may trigger a bunch of threads taking care of the matrix multiplication. Per default, this will take up as many CPUs as there are. Thus, as soon as we ramp up the job count n_jobs > 1, our CPUs gets helplessly over-subscribed. That is, we end up with n_cpus * n_jobs threads. This degrades the performance of each job and may further hinder the execution due to overhead introduced by thread switching (cf. joblib’s docs, Section 2.4.7)

Note, that I wrote that threads are only spawned “in some cases”. That is, theading will only kick in if

  • numpy is built against some parallelization library like BLAS (there are others as well like LAPCK or ATLAS)
  • X is large enough so that BLAS (or whatever underlying parallelization library numpy is built against) deems it worth it to spawn multiple threads. In my case, if I set X = np.random.rand(100,60) no threads will be spawned.

Current “solution”

Unfortunately this inherent parallelization (e.g., due to BLAS) can only be disabled by setting certain environment variables (see joblib’s documentation (see Section 2.4.7) for a list of relevant variables). This can for example be achieved using Python code directly:

With the os.environ statements we are telling whatever parallelization library numpy is compiled against to only use a single thread for its operations.

As an alternative to setting the environments in your code, you can set them directly before calling your script. For example:

Notes and further resources

Joblib’s own documentation

Joblib actually documents the CPU over-subscription issue (see their docs, Section 2.4.7) and claims that for joblib.Parallel they turn off threading by certain third-party libraries including for example numpy (at least for the loki parallelization backend). However, if I checked this by forcing the loky backend and in my case it still resulted in CPU over-subscription:

Links

There are also some more links I want to share which are relevant to the topic or where linked throughout this article:

And finally here is a link to a relevant bug report:

“Experimental setup”

For reference, the behavior described above was observed for Python3, joblib=0.13.1 and Ubuntu 18.04 with kernel Linux 4.15.0-43-generic x86_64.

“Bagging based on resampling with and without replacement is equivalent”, is it?

Recently, I found the following StackOverflow post as well as the corresponding paper that states that bagging based on resampling with and without replacement is equivalent in the scenarios considered by the authors. In particular, the paper states that they found:

An equivalence between bagging based on resampling with and without replacement: resample sizes M_\text{with} = \alpha_{with}N and M_\text{w/o} = \alpha_\text{w/o}N (for resampling with and without replacement, respectively) produce very similar bagged statistics if \alpha_\text{with} = \alpha_\text{w/o}/(1 - \alpha_\text{w/o}) . This approximate equality holds for each sample, not just on the average over samples. While our derivation is limited to U-statistics, the equivalence seems to hold more widely, as illustrated by our limited experiments with bagging cart trees.

in “Observations on Bagging” by Buja and Stuetzle (2006)

While the authors are optimistic that their finding may be general, this needs further investigation since their results are mainly limited to U-statistics. Also note that it is generally NOT possible to just replace sampling with replacement and sampling without replacement in any application scenario as argued in the notes on random forests below. Nevertheless, I think it is an interesting result. The following examples and notes try to illustrate what the paper “Observations on Bagging” by Buja and Stuetzle (2006) actually implicates and what it does not. Please feel free to comment and discuss 🙂

Illustrations

Here is some Python code illustrating the formula mentioned in the text:

It is apparent that, generally, replacement introduces variance, also that the formula from the paper correctly adjusts for that variance.

The following graph shows \alpha_\text{with} as a function of \alpha_\text{without}. The more of the overall data we sample without replacement, the more samples with replacement are required to get to the same results.

Note on random forests

For random forests, generally, the concept of replacement is considered essential. This is because the underlying concept of random forests is bagging to prevent overfitting, i.e., bagging builds an ensemble of estimators trained on data with a high variance (with regard to the training data they have seen).

In particular, random forests sample with replacement as many samples as present in the training data, that is, \alpha_\text{with}=1. According to the paper, this is equivalent to \alpha_\text{without}=\alpha_\text{with}/(1 + \alpha_\text{with})=1/2 in order to achieve the same variance as sampling with replacement. I haven’t tried this, but is sounds plausible. However, reducing the number of samples to train on by half, may get us in trouble if we only have a small number of samples to train on to begin with. This is because the individual estimators may not have enough data to learn anything.

Finally, if we trained random forests on data sampled without replacement, i.e., \alpha_\text{with}=1, this would mean that we have no variance in the data used to train the individual estimators of the ensemble. Thus the ensemble will become meaningless. By the way, if we wanted to have a \alpha_\text{without} matching \alpha_\text{with}=1, \alpha_\text{without} would approach infinity which is hinted on in the plot above and as can be seen from the corresponding formula.

Links

  • https://stats.stackexchange.com/questions/130662/what-is-the-effect-of-bootstap-resampling-in-bagging-algorithmensemble-learning
  • https://www.jstor.org/stable/24307547
  • https://www.quora.com/Why-does-random-forest-use-sampling-with-replacement-instead-of-without-replacement

Acknowledgements

Thanks to Florian Lemmerich for working on this with me 🙂

joblib.Parallel is overwriting variables

Joblib is a nice Python library for “lightweight pipelining”. This includes for example “easy simple parallel computing”. It is heavily used by scikit-learn to speed up for example machine learning algorithms. However, it has some quirks.

One of them is that joblib.Parallel (used for easily parallelizing for-loops) is overwriting certain variables defined by the outer scope. These variables include for example args, parser, exitcode, or spawn (see pierreglaser’s response here). This can be rather unexpected and cause confusion. There is a bug report asking to fix or to document this behavior. See the following code for an example:

In the previous code snippet you would expect get a list with two “5”s: [5,5]. Instead what you will get is a list of two Namespace object defined by the joblib.Parallel “context”.

For reference, this behavior was observed for Python3, joblib=0.13.1 and Ubuntu 18.04 with kernel Linux 4.15.0-43-generic x86_64.

Watch out for duplicate column names in pandas DataFrames

Pandas is a powerful data analysis engine. While it is most of the time easy to use and behaves as a good panda should, sometimes you just need to be aware of some details and their implications to stay on its friendly side. For example, a DataFrame

  • a) allows columns with duplicate names, and
  • b) when retrieving columns by such a name, pandas gives you a DataFrame that contains all columns with that name for each column name you selected.

This happens silently and may quickly and unexpectedly increase the size of your DataFrame quite significantly!

Consider the following example, with a dataset containing two feature groups (with duplicate feature names) and one target (e.g., for learning a linear regression model):

For this example, one would probably expect the feature DataFrame to contain only the features from group1. Thus, it would look like this:

However, due to the duplicate column names, our selected features (features_from_group1) contain the group1_a column name twice. Now, since pandas returns all columns matching for each given column name in features_from_group1, we a get a DataFrame with “too many” columns:

This makes sense, in a way. However, when you are not aware of it, you are quickly increasing the size of your DataFrame (or in the example above, the feature set to learn from) by quite a large margin without noticing. Depending on the amount of duplicate column names, this may blow up your memory, hinder your machine learning methods from learning efficiently (or from learning at all), or cause your HDD to fill up quickly when saving models 😉