Amog Kamsetty
Amog Kamsetty
How many workers are you using (`num_workers` passed into `RayPlugin`)? In a 2 GPU machine, with 1 GPU not being utilized (see https://github.com/ray-project/ray_lightning/issues/32#issuecomment-814313347), means that you can run a maximum...
Looking at your code it seems like num_workers is being set to the CPU count, is that correct? `num_workers = round(os.cpu_count() * 1.0)`. Wouldn't that set `num_workers` to 64?
Great, glad it's working now! So this plugin is useful for general-purpose distributed training with Pytorch Lightning on either single-node or a large multi-node cluster. When used with Tune, it...
Hey @igorgad - yep this is definitely something that we would like to do down the line and I can add it to our roadmap. OOC, this is mainly so...
Hey @jwohlwend! If you have GPUs on the the node where you are running the script from, can you set `gpus=1` in your `Trainer`? Unfortunately, PyTorch Lightning requires the driver...
That's right! Currently there's no way to use both Ray Client and multi-gpu fp16. But if there's a user request for it, I can look into supporting it :). I...
Hi @igorgad this could be very interesting. I think we would first want to add an ElasticHorovod plugin in Pytorch Lightning, and then add an ElasticRayHorovod plugin in this library....
Hey @mrkulk so I had started working on DDP Sharded+Ray integration in this PR https://github.com/ray-project/ray_lightning/pull/16, but this work is outdated since Pytorch Lightning had some major changes with distributed accelerators/plugins...
Fairscale integration has now been merged (#42). Check it out for low memory distributed training!
What's the behavior you're seeing when it stalls/times out? Is this an actual TimeoutError, or is it just hanging with no progress being made? It would be great if you...