composer icon indicating copy to clipboard operation
composer copied to clipboard

Interrupted System call when doing multi-GPU training

Open vedantroy opened this issue 2 years ago • 5 comments

** Environment **

  • 8 A100s although also happens on other GPU types
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 20.04.4 LTS
Release:        20.04
Codename:       focal

** To reproduce

Steps to reproduce the behavior:

  1. Do multi-node training with a lot of GPUs
  2. ???

Expected behavior

Does not crash with "Interrupted System Call"

Additional context

  File "train_mae_2d.py", line 120, in train
    run_trainer(

  File "train_mae_2d.py", line 41, in run_trainer
    trainer = make_trainer(

  File "/home/ubuntu/video-recommendation/trainer/trainer.py", line 78, in make_trainer
    return Trainer(

  File "/home/ubuntu/miniconda/envs/video-rec/lib/python3.8/site-packages/composer/trainer/trainer.py", line 781, in __init__
    dist.initialize_dist(self._device, datetime.timedelta(seconds=dist_timeout))

  File "/home/ubuntu/miniconda/envs/video-rec/lib/python3.8/site-packages/composer/utils/dist.py", line 433, in initialize_dist
    dist.init_process_group(device.dist_backend, timeout=timeout)

  File "/home/ubuntu/miniconda/envs/video-rec/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 595, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)

  File "/home/ubuntu/miniconda/envs/video-rec/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 257, in _env_rendezvous_handler
    store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout)

  File "/home/ubuntu/miniconda/envs/video-rec/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 188, in _create_c10d_store
    return TCPStore(

RuntimeError: Interrupted system call

vedantroy avatar Aug 21 '22 19:08 vedantroy

Thanks @vedantroy is there a model that you are training, where you encounter this issue? And is this also an issue you observe when running some of our standard benchmarks (e.g. ResNet50)? Some more information to help reproduce would be helpful.

hanlint avatar Aug 21 '22 23:08 hanlint

@hanlint If I provide a github repository + a Dockerfile, would that be helpful? I've also filed an issue here: https://github.com/pytorch/pytorch/issues/83824 since it might be a Pytorch issue.

vedantroy avatar Aug 22 '22 03:08 vedantroy

@hanlint Also to be clear, I can reliably reproduce this issue when training with multiple GPUs. It is somewhat inconsistent at 2, but it happens at >= 6 every time.

I suspect it is some form of race condition that I don't understand.

vedantroy avatar Aug 22 '22 08:08 vedantroy

Ok, additional details. The error is happening because my process is receiving a SIGCHILD signal, which is causing the interruption. I can workaround the error by doing a sleep before launching the trainer.

Unclear if this is a Mosaic or Pytorch bug.

vedantroy avatar Aug 22 '22 18:08 vedantroy

@vedantroy Can you please share the reproducible script? A github repository + a Dockerfile should work + the steps on how are you a lunching a run !! Thanks!

karan6181 avatar Oct 03 '22 22:10 karan6181

Closing since there was no follow up. Please feel free to reopen!

mvpatel2000 avatar Jun 22 '23 21:06 mvpatel2000