Amog Kamsetty comments

Results 55 comments of


                                            Amog Kamsetty

GPU MisconfigurationException

How many workers are you using (`num_workers` passed into `RayPlugin`)? In a 2 GPU machine, with 1 GPU not being utilized (see https://github.com/ray-project/ray_lightning/issues/32#issuecomment-814313347), means that you can run a maximum...

GPU MisconfigurationException

Looking at your code it seems like num_workers is being set to the CPU count, is that correct? `num_workers = round(os.cpu_count() * 1.0)`. Wouldn't that set `num_workers` to 64?

GPU MisconfigurationException

Great, glad it's working now! So this plugin is useful for general-purpose distributed training with Pytorch Lightning on either single-node or a large multi-node cluster. When used with Tune, it...

Fault tolerant workers

Hey @igorgad - yep this is definitely something that we would like to do down the line and I can add it to our roadmap. OOC, this is mainly so...

Error when using 16 bit precision

Hey @jwohlwend! If you have GPUs on the the node where you are running the script from, can you set `gpus=1` in your `Trainer`? Unfortunately, PyTorch Lightning requires the driver...

Error when using 16 bit precision

That's right! Currently there's no way to use both Ray Client and multi-gpu fp16. But if there's a user request for it, I can look into supporting it :). I...

ElasticRayExecutor

Hi @igorgad this could be very interesting. I think we would first want to add an ElasticHorovod plugin in Pytorch Lightning, and then add an ElasticRayHorovod plugin in this library....

[Question] Compatibility with PTL's deepspeed accelerator

Hey @mrkulk so I had started working on DDP Sharded+Ray integration in this PR https://github.com/ray-project/ray_lightning/pull/16, but this work is outdated since Pytorch Lightning had some major changes with distributed accelerators/plugins...

[Question] Compatibility with PTL's deepspeed accelerator

Fairscale integration has now been merged (#42). Check it out for low memory distributed training!

Stalls if num_workers is too high

What's the behavior you're seeing when it stalls/times out? Is this an actual TimeoutError, or is it just hanging with no progress being made? It would be great if you...