ray_lightning icon indicating copy to clipboard operation
ray_lightning copied to clipboard

Pytorch Lightning Distributed Accelerators using Ray

Results 66 ray_lightning issues
Sort by recently updated
recently updated
newest added

Hi, any plan to support Elastic Horovod? Thanks,.

Hi, Whenever a worker fails and ray tries to recreate it I receive the following error. ``` types.RayTaskError(ValueError): ray::RayExecutor.execute() (pid=10054, ip=172.31.18.173) File "python/ray/_raylet.pyx", line 534, in ray._raylet.execute_task File "python/ray/_raylet.pyx", line...

If an error occurs during training and later caught in the code, the GPUs are not released. I have a script which automatically relaunches any training runs which fail. Currently,...

Is ray_lightning currently compatible with the deepspeed accelerator in PTL?

We should make sure that this library works on SLURM, and provide a specific example for how to use the `RayAccelerator` on a slurm cluster, and with Tune as well....

I am now confused with how to set resources_per_trial={ 'cpu': 4, 'gpu': 2 } I have 2 RTX 3090s and 64 CPUs. I am passing pl.Trainer(...gpus=2) However I am getting...