CPU over-subscription by joblib.Parallel due to BLAS

As mentioned in a previous post, joblib (among other things) is a nice tool to easily parallelize for-loops via joblib.Parallel. However, in combination with BLAS-based libraries (like numpy) this can lead to unexpected heavy CPU over-subscription due to simple operations like a dot product spawning a multitude of threads. There is a bug report to solve or document this.

Behavior

Check out the following code example:

Here, joblib.Parallel runs func() sequentially since we set n_jobs=1. Thus, our script should only use one CPU. Well, in some cases it does not. Specifically, if an internal parallelization routine like BLAS kicks in. In this case, a simple dot product like X.dot(np.transpose(X)) may trigger a bunch of threads taking care of the matrix multiplication. Per default, this will take up as many CPUs as there are. Thus, as soon as we ramp up the job count n_jobs > 1, our CPUs gets helplessly over-subscribed. That is, we end up with n_cpus * n_jobs threads. This degrades the performance of each job and may further hinder the execution due to overhead introduced by thread switching (cf. joblib’s docs, Section 2.4.7)

Note, that I wrote that threads are only spawned “in some cases”. That is, theading will only kick in if

  • numpy is built against some parallelization library like BLAS (there are others as well like LAPCK or ATLAS)
  • X is large enough so that BLAS (or whatever underlying parallelization library numpy is built against) deems it worth it to spawn multiple threads. In my case, if I set X = np.random.rand(100,60) no threads will be spawned.

Current “solution”

Unfortunately this inherent parallelization (e.g., due to BLAS) can only be disabled by setting certain environment variables (see joblib’s documentation (see Section 2.4.7) for a list of relevant variables). This can for example be achieved using Python code directly:

With the os.environ statements we are telling whatever parallelization library numpy is compiled against to only use a single thread for its operations.

As an alternative to setting the environments in your code, you can set them directly before calling your script. For example:

Notes and further resources

Joblib’s own documentation

Joblib actually documents the CPU over-subscription issue (see their docs, Section 2.4.7) and claims that for joblib.Parallel they turn off threading by certain third-party libraries including for example numpy (at least for the loki parallelization backend). However, if I checked this by forcing the loky backend and in my case it still resulted in CPU over-subscription:

Links

There are also some more links I want to share which are relevant to the topic or where linked throughout this article:

And finally here is a link to a relevant bug report:

“Experimental setup”

For reference, the behavior described above was observed for Python3, joblib=0.13.1 and Ubuntu 18.04 with kernel Linux 4.15.0-43-generic x86_64.

Leave a Reply

Your email address will not be published. Required fields are marked *