ray_lightning
ray_lightning copied to clipboard
DataLoader multiprocessing causes jobs to fail
I get a "DataLoader killed by signal" error when I set the dataloader num_workers
to something other than 0, which disables multiprocessing. I'm running into bottlenecks due to this, is there a chance that this is being caused by ray lightning? I can try to make a script to reproduce the issue but I think it's highly dependent on the specific nature of the code, like what dataset is being used.
Kind of like this
It seems like this is due to the Ray Lightning workers not having enough memory: https://discuss.pytorch.org/t/runtimeerror-dataloader-worker-pid-27351-is-killed-by-signal-killed/91457/9.
Are you still seeing the same problem if you reduce the number of workers in the plugin? How many workers are being placed on each node?
I've only been trying with maximum 1 worker per node. The data should easily fit on the GPU memory.
So I get the error and then later on I get a CUDA OOM, but I don't know if it's due to the dataloader or due to #62...
I get this as well