ray_lightning DataLoader multiprocessing causes jobs to fail

DataLoader multiprocessing causes jobs to fail

Open import-antigravity opened this issue 3 years ago • 5 comments

I get a "DataLoader killed by signal" error when I set the dataloader num_workers to something other than 0, which disables multiprocessing. I'm running into bottlenecks due to this, is there a chance that this is being caused by ray lightning? I can try to make a script to reproduce the issue but I think it's highly dependent on the specific nature of the code, like what dataset is being used.

Sep 23 '21 01:09 import-antigravity

Kind of like this

Sep 23 '21 01:09 import-antigravity

It seems like this is due to the Ray Lightning workers not having enough memory: https://discuss.pytorch.org/t/runtimeerror-dataloader-worker-pid-27351-is-killed-by-signal-killed/91457/9.

Are you still seeing the same problem if you reduce the number of workers in the plugin? How many workers are being placed on each node?

Sep 23 '21 18:09 amogkam

I've only been trying with maximum 1 worker per node. The data should easily fit on the GPU memory.

Sep 23 '21 19:09 import-antigravity

So I get the error and then later on I get a CUDA OOM, but I don't know if it's due to the dataloader or due to #62...

Sep 23 '21 20:09 import-antigravity

I get this as well

Aug 09 '22 20:08 aced125

ray_lightning ray_lightning copied to clipboard

DataLoader multiprocessing causes jobs to fail

ray_lightning
ray_lightning copied to clipboard