ray_lightning
ray_lightning copied to clipboard
Cannot Use GPUStatsMonitor callback with Ray Lightning
The GPUStatsMonitor Callback records information about the GPU utilization in Tensorboard logs, however when running with ray_lightning, it raises a MisconfigurationException:
pytorch_lightning.utilities.exceptions.MisconfigurationException: You are using GPUStatsMonitor but are not running on GPU since gpus attribute in Trainer is set to None.
This is due to the code in the stats monitor callback:
if trainer._device_type != DeviceType.GPU:
raise MisconfigurationException(
"You are using GPUStatsMonitor but are not running on GPU"
f" since gpus attribute in Trainer is set to {trainer.gpus}."
)
It seems like ray_lightning, thus, doesn't set the DeviceType to GPU - which may have other unintended consequences later on.
This may also be solved by #118, but It's not entirely clear
Hey @DavidMChan yes that's right this is the same issue as https://github.com/ray-project/ray_lightning/issues/99. Ray Lightning does set the device type to gpu (when use_gpu=True
) but only on the workers that actually execute training. But for things like mixed precision or GPUStatsMonitor callback, Pytorch Lightning requires GPUs to be enabled on the driver side as well (even though they are not actually used). If you set gpus=1
in your Trainer, then this will tell PTL that the driver has GPUs available, and then this should work.
Unfortunately, this gets a bit tricky when wanting to use Ray Client, or executing a script with a CPU head node, but GPU worker nodes. PTL is not designed to support these types of deployments.