ray
ray copied to clipboard
[Core] ray.init() overrides sigterm handler and causes an error in torch.compile
What happened + What you expected to happen
Upon startup, I noticed a weird stack trace. The stack trace comes from a conflict of two things:
- When Ray starts up, it does a "signal monkey patch", where it prevents sending
SIGINT
being set (code). - When torch initiates compilation, it tries to set its own no-op
SIGINT
handler to avoid annoying output logs (code).
One way to fix this is to turn off asynchronous torch compilation by setting TORCHINDUCTOR_COMPILE_THREADS
to 1
(code). Empirically I verified this, but it doesn't seem good to force the torch compilation to be synchronous.
Is it possible to keep the torch compilation thread pool and avoid this exception when using Ray?
| (RayTrainWorker pid=6243) Exception in initializer: [repeated 247x across cluster] | (RayTrainWorker pid=6243) Exception in initializer: [repeated 247x across cluster]
-- | -- | --
| | (RayTrainWorker pid=6243) Traceback (most recent call last): [repeated 247x across cluster]
| File "/opt/conda/lib/python3.11/concurrent/futures/process.py", line 240, in _process_worker [repeated 247x across cluster] | (RayTrainWorker pid=6243) File "/opt/conda/lib/python3.11/concurrent/futures/process.py", line 240, in _process_worker [repeated 247x across cluster]
| initializer(*initargs) [repeated 247x across cluster] | (RayTrainWorker pid=6243) initializer(*initargs) [repeated 247x across cluster]
| File "/opt/conda/lib/python3.11/site-packages/torch/_inductor/codecache.py", line 2554, in _async_compile_initializer [repeated 247x across cluster] | (RayTrainWorker pid=6243) File "/opt/conda/lib/python3.11/site-packages/torch/_inductor/codecache.py", line 2554, in _async_compile_initializer [repeated 247x across cluster]
| signal.signal(signal.SIGINT, signal.SIG_IGN) [repeated 247x across cluster] | (RayTrainWorker pid=6243) signal.signal(signal.SIGINT, signal.SIG_IGN) [repeated 247x across cluster]
| File "/opt/conda/lib/python3.11/site-packages/ray/_private/utils.py", line 1879, in _signal_monkey_patch [repeated 247x across cluster] | (RayTrainWorker pid=6243) File "/opt/conda/lib/python3.11/site-packages/ray/_private/utils.py", line 1879, in _signal_monkey_patch [repeated 247x across cluster]
| raise ValueError( [repeated 247x across cluster] | (RayTrainWorker pid=6243) raise ValueError( [repeated 247x across cluster]
| ValueError: Can't set signal handler for SIGINT while SIGINT is being deferred within a DeferSigint context. [repeated 247x across cluster]
Versions / Dependencies
PyTorch 2.3.0, Ray 2.32.0
Reproduction script
import ray.train.torch
from ray.train import RunConfig
# WARNING: I have not directly tested this reduced script.
def train_func(config):
do_my_training()
def train_model():
num_gpus = int(ray.available_resources().get("GPU", 0))
scaling_config = ray.train.ScalingConfig(num_workers=num_workers, use_gpu=True)
run_config = RunConfig(name='some_name')
trainer = ray.train.torch.TorchTrainer(
train_func, train_loop_config=train_func_config, scaling_config=scaling_config, run_config=run_config
)
trainer.fit()
ray.init()
train_model()
Issue Severity
Low: It annoys or frustrates me.