ray_lightning icon indicating copy to clipboard operation
ray_lightning copied to clipboard

Not detecting GPU within trials when HPO

Open yinweisu opened this issue 2 years ago • 5 comments

I tried to use ray lightning + ray tune to do distributed HPO and found GPUs are not available within the trial even when I set use_gpu to be True in both the RayPlugin and get_tune_resources. I can see the following output in the trial: GPU available: False, used: False

Similar result can be observed with the provided example in the README, with corresponding changes to use_gpu.

I checked the usage of nvidia-smi and it looks like gpu is indeed being used. However, the trial not being able to detect gpu would cause various issue, such as when you do torch.load(), torch will complain there's no CUDA available.

yinweisu avatar Apr 13 '22 22:04 yinweisu

Hey @yinweisu that's right by default only the training worker processes reserve and use GPU, but the main process for each trial does not use GPU. This is because the main process does not actually do any training, so in general it does not make sense for it to reserve GPUs and not use it: see the discussion here https://github.com/ray-project/ray_lightning/issues/23.

If you do want to use GPUs for the main process, you can still do so: just pass in gpus=1 to your PyTorch Lightning Trainer and also manually specify resources_per_trial- you can just copy the code from get_tune_resources except replace this line with {"CPU": 1, "GPU": 1}.

This will reserve 1 GPU for the main process, but the main process will not actually do any training, so in general I would recommend against doing this.

What are you calling torch.load on and how are you calling it? The model saved to the Trainer should be always be a CPU model anyways.

amogkam avatar Apr 13 '22 22:04 amogkam

Thanks for the info. Just read through the thread. Does assigning fractional gpu to main process still work?

yinweisu avatar Apr 13 '22 23:04 yinweisu

Yes that would still work!

amogkam avatar Apr 13 '22 23:04 amogkam

Hi @amogkam, I'm trying to collect results of each distributed trial and do some post-processing in the end. Currently, we have logic checking if the global_rank == 0 and do the post-processing. However, because of the fact that the main process has only 1 cpu and no gpu, and we don't want to waste a gpu to only do post-processing and not training. I tried to do global_rank == 1 instead, and hopefully the worker process can do the post-processing. However, I found that global_rank == 1 branch will never be executed. I wonder how does the rank in ray lightning work?

For you to easier understand our structure, here is some pseudo code:

def trainable(config):
  trainer = pl.Trainer(xxx, plugin=RayPlugin(xxx))
  trainer.fit()
  if trainer.global_rank == 0:  # change to global_rank == 1 will never be executed
    # post processing

tune.run(trainable, xxx)

yinweisu avatar Apr 14 '22 17:04 yinweisu

Hey @yinweisu the trainable function is just run once, and does not actually do any training. Inside the trainable function, when you do trainer.fit with a Trainer that contains the RayPlugin, this will launch N Ray actors, and these actors are what actually do the data parallel training.

Since trainable doesn't actually do training and is just run once, the global rank of the trainer inside the trainable function doesn't really make sense and will always return 0.

To actually aggregate metrics across the different training processes, you would want to implement this in your Pytorch Lightning Module using the Pytorch Lightning APIs: this thread might be useful https://github.com/PyTorchLightning/pytorch-lightning/discussions/6501.

Let me know if this makes sense and if you have anymore questions!

amogkam avatar Apr 18 '22 20:04 amogkam