openfold
openfold copied to clipboard
DeepSpeed and Illegal Instruction
Hello,
I am currently trying to run a training on a SLURM environment. The IT-support of the cluster I am using is currently working on it (let's call it cluster 1
). Still I would like to ask if some of you have any idea where I might be able to find a workaround.
If I run the code on another cluster which seems to have less restrictions, I get the following std_out
Detected CUDA files, patching ldflags
Emitting ninja build file [...]/OpenFold/lib/conda/envs/openfold_venv/torch_extensions/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 4.999995470046997 seconds
Using [...]/OpenFold/lib/conda/envs/openfold_venv/torch_extensions as PyTorch extensions root...
Emitting ninja build file [...]/OpenFold/lib/conda/envs/openfold_venv/torch_extensions/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 1.236398696899414 seconds
Rank: 0 partition count [1] and sizes[(4120730, False)]
Using [...]/OpenFold/lib/conda/envs/openfold_venv/torch_extensions as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.02031540870666504 seconds
But when I try it on the cluster 1, I get 21721 Illegal instruction
. Really, no other useful error message, just illegal instruction. The graphics card that is used on cluster 1 is a A100 40G, further I do not have internet access when running code on the cluster and it seems to be very restrictive but I don't know what happens behind the scenes in DeepSpeed s.t. it does not work.
Also, it cannot be write permission issues, at least from what I can tell. I have redirected the torch_extensions
into the lib
folder as you can see from the message above. This directory was chmod -R 777 to rule out writing permission issues.
Does anyone have any idea on that?
Thanks in advance and have a nice day!
Sorry for the delayed response---I haven't seen this particular issue. I'll leave this open in case anyone else has.
In any case, you can always just train without DeepSpeed---simply remove the --deepspeed_config
option and everything should work fine.