ray
ray copied to clipboard
[Tune] Torch Lightning example hangs
What happened + What you expected to happen
Hi ! I am trying to run your Torch Lightning example.
However, the script never finishes, leaving me with 1 or more experiments running (No progress is shown in the training_iterations
).
I tried running this example because I also encountered this problem with another model I'm working on, and I wanted to see whether that was due to my poorly written code or some ray tune issue. Seeing that the problem also exists in the demo script, I'd say this is a Tune bug. I also tried killing all ray instances / workers between runs because I had IDLE ray processes. The issue also appears on Ray 1.13.1 and 1.12.0.
Versions / Dependencies
ray 2.0.0 Distributor ID: Ubuntu Description: Ubuntu 20.04.3 LTS Python 3.8.13
Reproduction script
Run https://github.com/ray-project/ray/tree/master/python/ray/tune/examples/mnist_pytorch_lightning.py
, it hangs indefinitely.
Issue Severity
High: It blocks me from completing my task.
what pytorch lightning version do you have @AlexandreRozier?
If this is with Pytorch lightning 1.7, then looks this is the same as this issue: https://github.com/ray-project/ray/issues/28197.
This has been fixed in the nightly versions of Ray. Alternatively, you can do this workaround to resolve the issue for Ray 2.0 or prior:
import ray
ray.init(runtime_env={"env_vars": {"PL_DISABLE_FORK": "1"}})
Add this to the beginning of your training script
Hi, I can confirm downgrading torch-lightning to 1.6.5 fixes the issue (still using torch 2.0.0). I was testing with torch-lightning==1.7.2, where the error occured. Thanks for the tip !
Sounds good! This can still work with PTL 1.7.2 if you add this to the top of your script:
import ray
ray.init(runtime_env={"env_vars": {"PL_DISABLE_FORK": "1"}})