ray_lightning icon indicating copy to clipboard operation
ray_lightning copied to clipboard

DataLoader multiprocessing causes jobs to fail

Open import-antigravity opened this issue 3 years ago • 5 comments

I get a "DataLoader killed by signal" error when I set the dataloader num_workers to something other than 0, which disables multiprocessing. I'm running into bottlenecks due to this, is there a chance that this is being caused by ray lightning? I can try to make a script to reproduce the issue but I think it's highly dependent on the specific nature of the code, like what dataset is being used.

import-antigravity avatar Sep 23 '21 01:09 import-antigravity

Kind of like this

import-antigravity avatar Sep 23 '21 01:09 import-antigravity

It seems like this is due to the Ray Lightning workers not having enough memory: https://discuss.pytorch.org/t/runtimeerror-dataloader-worker-pid-27351-is-killed-by-signal-killed/91457/9.

Are you still seeing the same problem if you reduce the number of workers in the plugin? How many workers are being placed on each node?

amogkam avatar Sep 23 '21 18:09 amogkam

I've only been trying with maximum 1 worker per node. The data should easily fit on the GPU memory.

import-antigravity avatar Sep 23 '21 19:09 import-antigravity

So I get the error and then later on I get a CUDA OOM, but I don't know if it's due to the dataloader or due to #62...

import-antigravity avatar Sep 23 '21 20:09 import-antigravity

I get this as well

aced125 avatar Aug 09 '22 20:08 aced125