Amog Kamsetty
Amog Kamsetty
@rogertrullo in this case each trial will run 1 data parallel training run, and that training run itself will be parallelized with 2 workers. So it would look like this...
It seems like this is due to the Ray Lightning workers not having enough memory: https://discuss.pytorch.org/t/runtimeerror-dataloader-worker-pid-27351-is-killed-by-signal-killed/91457/9. Are you still seeing the same problem if you reduce the number of workers...
I was getting this error with ray installation from master as well. Is there anything about this on the issues in the horovod repo?
Hi all, there is a WIP PR: https://github.com/ray-project/ray_lightning/pull/222, but unfortunately I don't have the bandwidth any time soon to pick it up. Any community contributions here are welcome!
Hey @DavidMChan yes that's right this is the same issue as https://github.com/ray-project/ray_lightning/issues/99. Ray Lightning does set the device type to gpu (when `use_gpu=True`) but only on the workers that actually...
Hey @Rizhiy sorry for the late response. By GPUs not being released, do you mean that there are still Ray processes being run on the GPUs?
@import-antigravity whats the issue you are seeing with #23? Can you add a comment to that issue?
@griff4692 can you post the full stack trace for this please along with your full code if possible?
Could you also share your code too please? Thanks!
Ah so the problem is that the `gpus` passed into `pl.Trainer` is not the same as the `gpus` set in `resources_per_trial`. Can you change the `gpus` passed into `pl.Trainer` to...