AutoCompressors icon indicating copy to clipboard operation
AutoCompressors copied to clipboard

torchrun error when generating training split

Open OswaldHe opened this issue 6 months ago • 3 comments

When I try to run run/train.sh for OPT-2.7b, it generates the training split for the first 5813 samples, then exit immediately without any error log.

Generating train split:   7%|▋         | 5813/81380 [00:35<03:31, 357.02 examples/s]E0731 23:14:13.108000 140299780256832 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -9) local_rank: 0 (pid: 488431) of binary: /home/oswaldhe/miniconda3/envs/autocompressor/bin/python
Traceback (most recent call last):
  File "/home/oswaldhe/miniconda3/envs/autocompressor/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/oswaldhe/miniconda3/envs/autocompressor/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
  File "/home/oswaldhe/miniconda3/envs/autocompressor/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
    run(args)
  File "/home/oswaldhe/miniconda3/envs/autocompressor/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/home/oswaldhe/miniconda3/envs/autocompressor/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/oswaldhe/miniconda3/envs/autocompressor/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

I'm running on NVIDIA-A100 40GB PCIe. What could be the possible issue? Thank you.

OswaldHe avatar Aug 01 '24 06:08 OswaldHe