Amog Kamsetty comments

Results 55 comments of


                                            Amog Kamsetty

num-workers clarification

@rogertrullo in this case each trial will run 1 data parallel training run, and that training run itself will be parallelized with 2 workers. So it would look like this...

DataLoader multiprocessing causes jobs to fail

It seems like this is due to the Ray Lightning workers not having enough memory: https://discuss.pytorch.org/t/runtimeerror-dataloader-worker-pid-27351-is-killed-by-signal-killed/91457/9. Are you still seeing the same problem if you reduce the number of workers...

horovod installation issue

I was getting this error with ray installation from master as well. Is there anything about this on the issues in the horovod repo?

[Code] add pytorch-lightning compatibility for 1.7.x

Hi all, there is a WIP PR: https://github.com/ray-project/ray_lightning/pull/222, but unfortunately I don't have the bandwidth any time soon to pick it up. Any community contributions here are welcome!

Cannot Use GPUStatsMonitor callback with Ray Lightning

Hey @DavidMChan yes that's right this is the same issue as https://github.com/ray-project/ray_lightning/issues/99. Ray Lightning does set the device type to gpu (when `use_gpu=True`) but only on the workers that actually...

Devices not being released if training fails

Hey @Rizhiy sorry for the late response. By GPUs not being released, do you mean that there are still Ray processes being run on the GPUs?

Slurm Support

@import-antigravity whats the issue you are seeing with #23? Can you add a comment to that issue?

GPU MisconfigurationException

@griff4692 can you post the full stack trace for this please along with your full code if possible?

GPU MisconfigurationException

Could you also share your code too please? Thanks!

GPU MisconfigurationException

Ah so the problem is that the `gpus` passed into `pl.Trainer` is not the same as the `gpus` set in `resources_per_trial`. Can you change the `gpus` passed into `pl.Trainer` to...