libtorch_cpu.so replaces some of R's built-in lapack routines
I have been working on a problem which I am having a very hard time to reproduce consistently. If I run the code that we have just as it is, I get 8 out of 12 threads spawned by mclapply to simply stop responding, using 0% CPU. To debug this I started to add cat statements all over the code, and eventually narrowed it down to the R function chol2inv.
Here begins the interesting part. I had no problems with the exact same setup before adding torch as a dependency to the package that I am developing. When torch is loaded, the problems seem to start. I then went on to make R print the pid of each thread so that I could determine which threads were the problematic ones. Attaching gdb to one of the problematic threads gave me some interesting information:
(gdb) where
#0 0x00007fe4e6522113 in ?? () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
#1 0x00007fe4e6520dd9 in ?? () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
#2 0x00007fe4c70ada93 in mkl_blas_dtrmm () from /home/jon/R/x86_64-pc-linux-gnu-library/3.6/torch/deps/./libtorch_cpu.so
#3 0x00007fe4c707a27b in mkl_lapack_dtrtri () from /home/jon/R/x86_64-pc-linux-gnu-library/3.6/torch/deps/./libtorch_cpu.so
#4 0x00007fe4bff10b66 in mkl_lapack_dpotri () from /home/jon/R/x86_64-pc-linux-gnu-library/3.6/torch/deps/./libtorch_cpu.so
#5 0x00007fe4bfd0f280 in mkl_lapack.dpotri_ () from /home/jon/R/x86_64-pc-linux-gnu-library/3.6/torch/deps/./libtorch_cpu.so
#6 0x00007fe4ba53a11b in ?? () from /usr/lib/R/modules//lapack.so
#7 0x00007fe4ba53b74e in ?? () from /usr/lib/R/modules//lapack.so
#8 0x00007fe4e89d4ed1 in ?? () from /usr/lib/R/lib/libR.so
#9 0x00007fe4e89e1c80 in Rf_eval () from /usr/lib/R/lib/libR.so
We can see that the call to chol2inv ended up inside a lapack routine from libtorch_cpu.so. It seems that the problem then is that we encounter a strange edge case where my multithreading from mclapply causes something to break inside dpotri, when it itself tries to multithread.
I have tried to create a minimal example to reproduce this, with limited success. I am unable to share the code that causes the problem as it is proprietary. Sorry for this.
Do you have any input on the situation? Is this expected behaviour that torch should replace the internal functions? Is it possible to have them co-exist peacefully?
After some further research I think that the problem is related to the use of OpenMP inside the torch code. I am however still unsure about if it is an intentional thing that the torch lapack routines overwrite the ones that R have built in.
A link discussing the problem of forking (mclapply) and OpenMP https://bisqwit.iki.fi/story/howto/openmp/#OpenmpAndFork
Another update concerning this issue: I found that if I run torch_set_num_threads(1) before really running anything else I eliminate the problem. I consider this a very good workaround, but stay curious on the issue of torch being used instead of the built-in lapack routines.
Hi @jonlachmann !
Thanks for raising the issue and for all the updates. Sorry, I'll need some time to investigate this. It's indeed unexpect that torch Lapack routines overwrite R's.
This could also crash R session if it's built with custom BLAS/LAPACK. I address this by compiling both R and LibTorch from sources code, with linking to Intel MKL.