openfold icon indicating copy to clipboard operation
openfold copied to clipboard

DeepSpeed and Illegal Instruction

Open DA-L3 opened this issue 2 years ago • 1 comments

Hello,

I am currently trying to run a training on a SLURM environment. The IT-support of the cluster I am using is currently working on it (let's call it cluster 1). Still I would like to ask if some of you have any idea where I might be able to find a workaround.

If I run the code on another cluster which seems to have less restrictions, I get the following std_out

Detected CUDA files, patching ldflags
Emitting ninja build file [...]/OpenFold/lib/conda/envs/openfold_venv/torch_extensions/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 4.999995470046997 seconds
Using [...]/OpenFold/lib/conda/envs/openfold_venv/torch_extensions as PyTorch extensions root...
Emitting ninja build file [...]/OpenFold/lib/conda/envs/openfold_venv/torch_extensions/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 1.236398696899414 seconds
Rank: 0 partition count [1] and sizes[(4120730, False)] 
Using [...]/OpenFold/lib/conda/envs/openfold_venv/torch_extensions as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.02031540870666504 seconds

But when I try it on the cluster 1, I get 21721 Illegal instruction. Really, no other useful error message, just illegal instruction. The graphics card that is used on cluster 1 is a A100 40G, further I do not have internet access when running code on the cluster and it seems to be very restrictive but I don't know what happens behind the scenes in DeepSpeed s.t. it does not work. Also, it cannot be write permission issues, at least from what I can tell. I have redirected the torch_extensions into the lib folder as you can see from the message above. This directory was chmod -R 777 to rule out writing permission issues.

Does anyone have any idea on that?

Thanks in advance and have a nice day!

DA-L3 avatar Aug 25 '22 12:08 DA-L3

Sorry for the delayed response---I haven't seen this particular issue. I'll leave this open in case anyone else has.

In any case, you can always just train without DeepSpeed---simply remove the --deepspeed_config option and everything should work fine.

gahdritz avatar Oct 06 '22 05:10 gahdritz