ray_lightning Error when using 16 bit precision

Hi, thank you for the great integration of Lightning & Ray!

I found that using 16 bit precision returns the following error: pytorch_lightning.utilities.exceptions.MisconfigurationException: You have asked for native AMP on CPU, but AMP is only available on GPU

Using the same script, with 32 bit works fine.

I believe this is due to the fact that the number of GPU's is only set in the Ray Plugin and not in the Lightning Trainer, and this check happens prior to the Ray plugin being utilized. More generally it may be a bit dangerous to not set the Trainer number of gpus when actually intending to use GPU's as there may be other internal checks that Lightning does which could lead to unexpected behavior such as this one.

Would appreciate any tips on getting this integration to work with half precision training, thank you!

Oct 27 '21 14:10 jwohlwend

Hey @jwohlwend! If you have GPUs on the the node where you are running the script from, can you set gpus=1 in your Trainer?

Unfortunately, PyTorch Lightning requires the driver to have access to GPUs, which does not allow setups such as CPU head node + GPU workers, or running the driver script on your laptop via Ray Client. So I had to do a hack to only set GPUs on the worker, and not the driver, which is why you're seeing this error. If you also set GPUs on the driver then this should be resolved, so let me know if the above change works for you.

Oct 27 '21 18:10 amogkam

Awesome! I tried it and it seems to work! Just to make sure I understand, the drawback to setting gpus=1 in the Trainer is that one cannot use a cpu driver anymore and that is required for Ray Client? Does that imply that there would be no way of using both Ray Client & multi-gpu fp16? Not that it matters for my use cases but I am curious :)

Oct 27 '21 19:10 jwohlwend

That's right! Currently there's no way to use both Ray Client and multi-gpu fp16. But if there's a user request for it, I can look into supporting it :). I think the implementation would just be triggering fp16 on the worker side instead of the driver side.

Oct 27 '21 19:10 amogkam

Yeah makes sense, I do not require this feature so I'll close the issue here :) Thanks for the quick response!

Oct 27 '21 19:10 jwohlwend

@amogkam it appears that the fix you proposed breaks tune usage, i get errors of the type: you requested gpus [0] but you machine has []. Removing the gpu argument to the Trainer results in tune working as expected

Nov 01 '21 19:11 jwohlwend

I am encountering this issue too.

I am following the recommendation for leaving the head node configured with resources: {"CPU": 0} to avoid scheduling additional tasks on it. ( https://docs.ray.io/en/master/cluster/guide.html#deployment-guide ) The pattern of CPU head and GPU worker nodes seems very reasonable (at least for large clusters)

Also If one has a mix of workers with acceleator_type:T4 and others with accelerator_type:V100 Currently there is no way one can specify WHERE to schedule these actors. [But I think this needs to go in another issue]

Also using RAY CLIENT.

@amogkam +1 to support CPU head + GPU workers. Thanks!

Nov 04 '21 00:11 pablete

ray_lightning ray_lightning copied to clipboard

Error when using 16 bit precision

ray_lightning
ray_lightning copied to clipboard