ray_lightning
ray_lightning copied to clipboard
Pytorch Lightning Distributed Accelerators using Ray
Hi, any plan to support Elastic Horovod? Thanks,.
Hi, Whenever a worker fails and ray tries to recreate it I receive the following error. ``` types.RayTaskError(ValueError): ray::RayExecutor.execute() (pid=10054, ip=172.31.18.173) File "python/ray/_raylet.pyx", line 534, in ray._raylet.execute_task File "python/ray/_raylet.pyx", line...
If an error occurs during training and later caught in the code, the GPUs are not released. I have a script which automatically relaunches any training runs which fail. Currently,...
Is ray_lightning currently compatible with the deepspeed accelerator in PTL?
We should make sure that this library works on SLURM, and provide a specific example for how to use the `RayAccelerator` on a slurm cluster, and with Tune as well....
I am now confused with how to set resources_per_trial={ 'cpu': 4, 'gpu': 2 } I have 2 RTX 3090s and 64 CPUs. I am passing pl.Trainer(...gpus=2) However I am getting...