sphericart icon indicating copy to clipboard operation
sphericart copied to clipboard

Excessive memory usage during compilation with pip

Open sirmarcel opened this issue 1 year ago • 5 comments

Currently, attempting to build the sphericart-torch wheel with pip requires a large amount of RAM if many CPU cores are present. I think this is due to this line, which invokes cmake without specifying the number of jobs, which presumably will default to the total number of cores. On a HPC system those can be 40 or 80, and so compilation tends to get killed by the host OS.

While this is not catastrophic, it is inconvenient, and a waste of resources in many cases (the compilation is not much faster in parallel mode). I would suggest defaulting to some reasonable default instead, or disabling parallel builds entirely. Alternatively, the installation docs should at least mention this fact (see #116).

sirmarcel avatar Apr 29 '24 17:04 sirmarcel

Thanks for the find, this is a very good point. I'll address this in a PR tomorrow.

rubber-duck-debug avatar Apr 29 '24 20:04 rubber-duck-debug

Thanks @nickjbrowning !

sirmarcel avatar Apr 29 '24 21:04 sirmarcel

One thing I don't understand here is that we don't have that many files to compile, so make -j and make -j8 should have the same behavior (launch ~8 compilation jobs).

Luthaf avatar Apr 30 '24 09:04 Luthaf

It's a bit suspicious. My observation is: (a) compilation dies with kill on the default allocation on izar (4GB I believe), (b) if you remove --parallel from the setup.py file of sphericart-torch, it works without problem, (c) requesting a node with 32GB also works, without modification.

sirmarcel avatar Apr 30 '24 10:04 sirmarcel

Oh, right. I can see the compiler requiring a couple of GiB per file (there are a lot of torch header to parse and template to instantiate), so parallel compilation would fail with only 4GiB of available RAM. But then the changed by @nickjbrowning would not fix it here, since the compilation would also fail with only 8 jobs.

Luthaf avatar Apr 30 '24 11:04 Luthaf

I've added these two environment variables to the build process:

SPHERICART_PARALLEL_BUILD=ON
SPHERICART_JOBS=NJOBS

So you can now control the number of build jobs via:

SPHERICART_PARALLEL_BUILD=OFF pip install .[torch] #disables parallel builds
SPHERICART_JOBS=4 pip install .[torch] #uses 4 jobs for compilation

rubber-duck-debug avatar Jul 12 '24 08:07 rubber-duck-debug