composer
composer copied to clipboard
Interrupted System call when doing multi-GPU training
** Environment **
- 8 A100s although also happens on other GPU types
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 20.04.4 LTS
Release: 20.04
Codename: focal
** To reproduce
Steps to reproduce the behavior:
- Do multi-node training with a lot of GPUs
- ???
Expected behavior
Does not crash with "Interrupted System Call"
Additional context
File "train_mae_2d.py", line 120, in train
run_trainer(
File "train_mae_2d.py", line 41, in run_trainer
trainer = make_trainer(
File "/home/ubuntu/video-recommendation/trainer/trainer.py", line 78, in make_trainer
return Trainer(
File "/home/ubuntu/miniconda/envs/video-rec/lib/python3.8/site-packages/composer/trainer/trainer.py", line 781, in __init__
dist.initialize_dist(self._device, datetime.timedelta(seconds=dist_timeout))
File "/home/ubuntu/miniconda/envs/video-rec/lib/python3.8/site-packages/composer/utils/dist.py", line 433, in initialize_dist
dist.init_process_group(device.dist_backend, timeout=timeout)
File "/home/ubuntu/miniconda/envs/video-rec/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 595, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/home/ubuntu/miniconda/envs/video-rec/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 257, in _env_rendezvous_handler
store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout)
File "/home/ubuntu/miniconda/envs/video-rec/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 188, in _create_c10d_store
return TCPStore(
RuntimeError: Interrupted system call
Thanks @vedantroy is there a model that you are training, where you encounter this issue? And is this also an issue you observe when running some of our standard benchmarks (e.g. ResNet50)? Some more information to help reproduce would be helpful.
@hanlint If I provide a github repository + a Dockerfile, would that be helpful? I've also filed an issue here: https://github.com/pytorch/pytorch/issues/83824 since it might be a Pytorch issue.
@hanlint Also to be clear, I can reliably reproduce this issue when training with multiple GPUs. It is somewhat inconsistent at 2, but it happens at >= 6 every time.
I suspect it is some form of race condition that I don't understand.
Ok, additional details. The error is happening because my process is receiving a SIGCHILD signal, which is causing the interruption. I can workaround the error by doing a sleep
before launching the trainer.
Unclear if this is a Mosaic or Pytorch bug.
@vedantroy Can you please share the reproducible script? A github repository + a Dockerfile should work + the steps on how are you a lunching a run !! Thanks!
Closing since there was no follow up. Please feel free to reopen!