AutoCompressors
AutoCompressors copied to clipboard
torchrun error when generating training split
When I try to run run/train.sh for OPT-2.7b, it generates the training split for the first 5813 samples, then exit immediately without any error log.
Generating train split: 7%|▋ | 5813/81380 [00:35<03:31, 357.02 examples/s]E0731 23:14:13.108000 140299780256832 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -9) local_rank: 0 (pid: 488431) of binary: /home/oswaldhe/miniconda3/envs/autocompressor/bin/python
Traceback (most recent call last):
File "/home/oswaldhe/miniconda3/envs/autocompressor/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/home/oswaldhe/miniconda3/envs/autocompressor/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
return f(*args, **kwargs)
File "/home/oswaldhe/miniconda3/envs/autocompressor/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/home/oswaldhe/miniconda3/envs/autocompressor/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/home/oswaldhe/miniconda3/envs/autocompressor/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/oswaldhe/miniconda3/envs/autocompressor/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
I'm running on NVIDIA-A100 40GB PCIe. What could be the possible issue? Thank you.