ray icon indicating copy to clipboard operation
ray copied to clipboard

[Tune] Torch Lightning example hangs

Open AlexandreRozier opened this issue 2 years ago • 2 comments

What happened + What you expected to happen

Hi ! I am trying to run your Torch Lightning example. However, the script never finishes, leaving me with 1 or more experiments running (No progress is shown in the training_iterations). image

I tried running this example because I also encountered this problem with another model I'm working on, and I wanted to see whether that was due to my poorly written code or some ray tune issue. Seeing that the problem also exists in the demo script, I'd say this is a Tune bug. I also tried killing all ray instances / workers between runs because I had IDLE ray processes. The issue also appears on Ray 1.13.1 and 1.12.0.

Versions / Dependencies

ray 2.0.0 Distributor ID: Ubuntu Description: Ubuntu 20.04.3 LTS Python 3.8.13

Reproduction script

Run https://github.com/ray-project/ray/tree/master/python/ray/tune/examples/mnist_pytorch_lightning.py, it hangs indefinitely.

Issue Severity

High: It blocks me from completing my task.

AlexandreRozier avatar Sep 21 '22 14:09 AlexandreRozier

what pytorch lightning version do you have @AlexandreRozier?

amogkam avatar Sep 21 '22 18:09 amogkam

If this is with Pytorch lightning 1.7, then looks this is the same as this issue: https://github.com/ray-project/ray/issues/28197.

This has been fixed in the nightly versions of Ray. Alternatively, you can do this workaround to resolve the issue for Ray 2.0 or prior:

import ray
ray.init(runtime_env={"env_vars": {"PL_DISABLE_FORK": "1"}})

Add this to the beginning of your training script

amogkam avatar Sep 21 '22 18:09 amogkam

Hi, I can confirm downgrading torch-lightning to 1.6.5 fixes the issue (still using torch 2.0.0). I was testing with torch-lightning==1.7.2, where the error occured. Thanks for the tip !

AlexandreRozier avatar Sep 22 '22 08:09 AlexandreRozier

Sounds good! This can still work with PTL 1.7.2 if you add this to the top of your script:

import ray
ray.init(runtime_env={"env_vars": {"PL_DISABLE_FORK": "1"}})

amogkam avatar Sep 22 '22 18:09 amogkam