examples icon indicating copy to clipboard operation
examples copied to clipboard

pytorch:2.0.0 ddp training error but the old version is good

Open alicera opened this issue 2 years ago • 1 comments

Your issue may already be reported! Please search on the issue tracker before creating one.

Context

  • Pytorch version:
  • Operating System and version: pytorch/pytorch:2.0.0-cuda11.7-cudnn8-runtime

Your Environment

  • Installed using source? [yes/no]:
  • Are you planning to deploy it using docker container? [yes/no]:
  • Is it a CPU or GPU environment?: Gpu
  • Which example are you using:
  • Link to code or data to repro [if any]: pytorch/pytorch:2.0.0-cuda11.7-cudnn8-runtime

Expected Behavior

run good with https://github.com/pytorch/examples/blob/main/distributed/ddp/main.py

Current Behavior

torchrun --nproc-per-node 4 train.py 
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 0 (pid: 108) of binary: /opt/conda/bin/python
Traceback (most recent call last):
  File "/opt/conda/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.0.0', 'console_scripts', 'torchrun')())
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 

Possible Solution

docker.io/pytorch/pytorch:2110 using the environment to run the torchrun, it will be successful.

https://github.com/pytorch/examples/blob/main/distributed/ddp/main.py

Steps to Reproduce

...

Failure Logs [if any]

alicera avatar Apr 30 '23 01:04 alicera

cc @rohan-varma @mrshenli

msaroufim avatar May 01 '23 17:05 msaroufim