ray icon indicating copy to clipboard operation
ray copied to clipboard

[Core] ray.init() overrides sigterm handler and causes an error in torch.compile

Open mritterfigma opened this issue 6 months ago • 2 comments

What happened + What you expected to happen

Upon startup, I noticed a weird stack trace. The stack trace comes from a conflict of two things:

  1. When Ray starts up, it does a "signal monkey patch", where it prevents sending SIGINT being set (code).
  2. When torch initiates compilation, it tries to set its own no-op SIGINT handler to avoid annoying output logs (code).

One way to fix this is to turn off asynchronous torch compilation by setting TORCHINDUCTOR_COMPILE_THREADS to 1 (code). Empirically I verified this, but it doesn't seem good to force the torch compilation to be synchronous.

Is it possible to keep the torch compilation thread pool and avoid this exception when using Ray?

  | (RayTrainWorker pid=6243) Exception in initializer: [repeated 247x across cluster] | (RayTrainWorker pid=6243) Exception in initializer: [repeated 247x across cluster]
-- | -- | --
  |  | (RayTrainWorker pid=6243) Traceback (most recent call last): [repeated 247x across cluster]
  |    File "/opt/conda/lib/python3.11/concurrent/futures/process.py", line 240, in _process_worker [repeated 247x across cluster] | (RayTrainWorker pid=6243) File "/opt/conda/lib/python3.11/concurrent/futures/process.py", line 240, in _process_worker [repeated 247x across cluster]
  |      initializer(*initargs) [repeated 247x across cluster] | (RayTrainWorker pid=6243) initializer(*initargs) [repeated 247x across cluster]
  |    File "/opt/conda/lib/python3.11/site-packages/torch/_inductor/codecache.py", line 2554, in _async_compile_initializer [repeated 247x across cluster] | (RayTrainWorker pid=6243) File "/opt/conda/lib/python3.11/site-packages/torch/_inductor/codecache.py", line 2554, in _async_compile_initializer [repeated 247x across cluster]
  |      signal.signal(signal.SIGINT, signal.SIG_IGN) [repeated 247x across cluster] | (RayTrainWorker pid=6243) signal.signal(signal.SIGINT, signal.SIG_IGN) [repeated 247x across cluster]
  |    File "/opt/conda/lib/python3.11/site-packages/ray/_private/utils.py", line 1879, in _signal_monkey_patch [repeated 247x across cluster] | (RayTrainWorker pid=6243) File "/opt/conda/lib/python3.11/site-packages/ray/_private/utils.py", line 1879, in _signal_monkey_patch [repeated 247x across cluster]
  |      raise ValueError( [repeated 247x across cluster] | (RayTrainWorker pid=6243) raise ValueError( [repeated 247x across cluster]
  |  ValueError: Can't set signal handler for SIGINT while SIGINT is being deferred within a DeferSigint context. [repeated 247x across cluster]

Versions / Dependencies

PyTorch 2.3.0, Ray 2.32.0

Reproduction script

import ray.train.torch
from ray.train import RunConfig

# WARNING: I have not directly tested this reduced script.

def train_func(config):
    do_my_training()

def train_model():
    num_gpus = int(ray.available_resources().get("GPU", 0))
    scaling_config = ray.train.ScalingConfig(num_workers=num_workers, use_gpu=True)

    run_config = RunConfig(name='some_name')
    trainer = ray.train.torch.TorchTrainer(
        train_func, train_loop_config=train_func_config, scaling_config=scaling_config, run_config=run_config
    )
    trainer.fit()

ray.init()
train_model()

Issue Severity

Low: It annoys or frustrates me.

mritterfigma avatar Aug 07 '24 02:08 mritterfigma