ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

[BUG]: auto_parallel example failed with 2x3060 on the same node (Error: The new group's rank should be within the the world_size set by init_process_group)

Open captainst opened this issue 2 years ago β€’ 6 comments

πŸ› Describe the bug

I described the Hardware / Software Environment of the setup in "Environment" section

I ran the command "colossalai run --nproc_per_node 2 auto_parallel_with_resnet.py" since I have 2 x 3060 installed and working properly.

The error (log) message: (format adjusted, hopefully not overwhelming)

Error Message of interest: The new group's rank should be within the the world_size set by init_process_group

/home/chen/anaconda3/envs/colossalai/lib/python3.9/site-packages/torch/library.py:130: UserWarning: Overriding a previously registered kernel for the same operator and the same dispatch key
  operator: aten::index.Tensor(Tensor self, Tensor?[] indices) -> Tensor
    registered at aten/src/ATen/RegisterSchema.cpp:6
  dispatch key: Meta
  previous kernel: registered at ../aten/src/ATen/functorch/BatchRulesScatterOps.cpp:1053
       new kernel: registered at /dev/null:228 (Triggered internally at ../aten/src/ATen/core/dispatch/OperatorEntry.cpp:150.)
  self.m.impl(name, dispatch_key, fn)
WARNING:torch.distributed.run:
*
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*
/home/chen/anaconda3/envs/colossalai/lib/python3.9/site-packages/torch/library.py:130: UserWarning: Overriding a previously registered kernel for the same operator and the same dispatch key
  operator: aten::index.Tensor(Tensor self, Tensor?[] indices) -> Tensor
    registered at aten/src/ATen/RegisterSchema.cpp:6
  dispatch key: Meta
  previous kernel: registered at ../aten/src/ATen/functorch/BatchRulesScatterOps.cpp:1053
       new kernel: registered at /dev/null:228 (Triggered internally at ../aten/src/ATen/core/dispatch/OperatorEntry.cpp:150.)
  self.m.impl(name, dispatch_key, fn)
/home/chen/anaconda3/envs/colossalai/lib/python3.9/site-packages/torch/library.py:130: UserWarning: Overriding a previously registered kernel for the same operator and the same dispatch key
  operator: aten::index.Tensor(Tensor self, Tensor?[] indices) -> Tensor
    registered at aten/src/ATen/RegisterSchema.cpp:6
  dispatch key: Meta
  previous kernel: registered at ../aten/src/ATen/functorch/BatchRulesScatterOps.cpp:1053
       new kernel: registered at /dev/null:228 (Triggered internally at ../aten/src/ATen/core/dispatch/OperatorEntry.cpp:150.)
  self.m.impl(name, dispatch_key, fn)
/home/chen/anaconda3/envs/colossalai/lib/python3.9/site-packages/colossalai/auto_parallel/tensor_shard/solver/solver.py:20: UserWarning: please install the pulp
  warnings.warn(f'please install the pulp')
/home/chen/anaconda3/envs/colossalai/lib/python3.9/site-packages/colossalai/auto_parallel/tensor_shard/solver/solver.py:20: UserWarning: please install the pulp
  warnings.warn(f'please install the pulp')
[02/26/23 21:47:23] INFO     colossalai - colossalai - INFO:
                             /home/chen/anaconda3/envs/colossalai/lib/python3.9/
                             site-packages/colossalai/context/parallel_context.p
                             y:521 set_device
                    INFO     colossalai - colossalai - INFO: process rank 0 is
                             bound to device 0
[02/26/23 21:47:23] INFO     colossalai - colossalai - INFO:
                             /home/chen/anaconda3/envs/colossalai/lib/python3.9/
                             site-packages/colossalai/context/parallel_context.p
                             y:521 set_device
                    INFO     colossalai - colossalai - INFO: process rank 1 is
                             bound to device 1
[02/26/23 21:47:24] INFO     colossalai - colossalai - INFO:
                             /home/chen/anaconda3/envs/colossalai/lib/python3.9/
                             site-packages/colossalai/context/parallel_context.p
                             y:557 set_seed
[02/26/23 21:47:24] INFO     colossalai - colossalai - INFO:
                             /home/chen/anaconda3/envs/colossalai/lib/python3.9/
                             site-packages/colossalai/context/parallel_context.p
                             y:557 set_seed
                    INFO     colossalai - colossalai - INFO: initialized seed on
                             rank 0, numpy: 1024, python random: 1024,
                             ParallelMode.DATA: 1024, ParallelMode.TENSOR:
                             1024,the default parallel seed is
                             ParallelMode.DATA.
                    INFO     colossalai - colossalai - INFO: initialized seed on
                             rank 1, numpy: 1024, python random: 1024,
                             ParallelMode.DATA: 1024, ParallelMode.TENSOR:
                             1024,the default parallel seed is
                             ParallelMode.DATA.
                    INFO     colossalai - colossalai - INFO:
                             /home/chen/anaconda3/envs/colossalai/lib/python3.9/
                             site-packages/colossalai/initialize.py:116 launch
                    INFO     colossalai - colossalai - INFO: Distributed
                             environment is initialized, data parallel size: 2,
                             pipeline parallel size: 1, tensor parallel size: 1
Traceback (most recent call last):
  File "/mnt/data/colossalai/ColossalAI/examples/tutorial/auto_parallel/auto_parallel_with_resnet.py", line 95, in <module>
    main()
  File "/mnt/data/colossalai/ColossalAI/examples/tutorial/auto_parallel/auto_parallel_with_resnet.py", line 28, in main
    device_mesh = DeviceMesh(physical_mesh_id=torch.tensor([0, 1, 2, 3]), mesh_shape=[2, 2], init_process_group=True)
  File "/home/chen/anaconda3/envs/colossalai/lib/python3.9/site-packages/colossalai/device/device_mesh.py", line 61, in __init__
    self.process_groups_dict = self.create_process_groups_for_logical_mesh()
  File "/home/chen/anaconda3/envs/colossalai/lib/python3.9/site-packages/colossalai/device/device_mesh.py", line 132, in create_process_groups_for_logical_mesh
    process_group_handler = dist.new_group(process_group)
  File "/home/chen/anaconda3/envs/colossalai/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 3321, in new_group
    raise RuntimeError(
RuntimeError: The new group's rank should be within the the world_size set by init_process_group
Traceback (most recent call last):
  File "/mnt/data/colossalai/ColossalAI/examples/tutorial/auto_parallel/auto_parallel_with_resnet.py", line 95, in <module>
    main()
  File "/mnt/data/colossalai/ColossalAI/examples/tutorial/auto_parallel/auto_parallel_with_resnet.py", line 28, in main
    device_mesh = DeviceMesh(physical_mesh_id=torch.tensor([0, 1, 2, 3]), mesh_shape=[2, 2], init_process_group=True)
  File "/home/chen/anaconda3/envs/colossalai/lib/python3.9/site-packages/colossalai/device/device_mesh.py", line 61, in __init__
    self.process_groups_dict = self.create_process_groups_for_logical_mesh()
  File "/home/chen/anaconda3/envs/colossalai/lib/python3.9/site-packages/colossalai/device/device_mesh.py", line 132, in create_process_groups_for_logical_mesh
    process_group_handler = dist.new_group(process_group)
  File "/home/chen/anaconda3/envs/colossalai/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 3321, in new_group
    raise RuntimeError(
RuntimeError: The new group's rank should be within the the world_size set by init_process_group
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 3169) of binary: /home/chen/anaconda3/envs/colossalai/bin/python
Traceback (most recent call last):
  File "/home/chen/anaconda3/envs/colossalai/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/chen/anaconda3/envs/colossalai/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/chen/anaconda3/envs/colossalai/lib/python3.9/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/home/chen/anaconda3/envs/colossalai/lib/python3.9/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/home/chen/anaconda3/envs/colossalai/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/chen/anaconda3/envs/colossalai/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
auto_parallel_with_resnet.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2023-02-26_21:47:27
  host      : chen-B350GT3
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 3170)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-02-26_21:47:27
  host      : chen-B350GT3
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 3169)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Error: failed to run torchrun --nproc_per_node=2 --nnodes=1 --node_rank=0 --rdzv_backend=c10d --rdzv_endpoint=127.0.0.1:29500 --rdzv_id=colossalai-default-job auto_parallel_with_resnet.py on 127.0.0.1, is localhost: True, exception: Encountered a bad command exit code!

Command: 'cd /mnt/data/colossalai/ColossalAI/examples/tutorial/auto_parallel && export SHELL="/bin/bash" CONDA_EXE="/home/chen/anaconda3/bin/conda" LC_ADDRESS="zh_CN.UTF-8" LC_NAME="zh_CN.UTF-8" LC_MONETARY="zh_CN.UTF-8" PWD="/mnt/data/colossalai/ColossalAI/examples/tutorial/auto_parallel" LOGNAME="chen" XDG_SESSION_TYPE="tty" CONDA_PREFIX="/home/chen/anaconda3/envs/colossalai" MOTD_SHOWN="pam" HOME="/home/chen" LC_PAPER="zh_CN.UTF-8" LANG="en_US.UTF-8" LS_COLORS="rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.wim=01;31:*.swm=01;31:*.dwm=01;31:*.esd=01;31:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:" CONDA_PROMPT_MODIFIER="(colossalai) " SSH_CONNECTION="10.71.207.213 21376 10.71.207.134 22" LESSCLOSE="/usr/bin/lesspipe %s %s" XDG_SESSION_CLASS="user" LC_IDENTIFICATION="zh_CN.UTF-8" TERM="xterm" LESSOPEN="| /usr/bin/lesspipe %s" USER="chen" CONDA_SHLVL="2" SHLVL="1" LC_TELEPHONE="zh_CN.UTF-8" LC_MEASUREMENT="zh_CN.UTF-8" XDG_SESSION_ID="5" CONDA_PYTHON_EXE="/home/chen/anaconda3/bin/python" LD_LIBRARY_PATH="/usr/local/cuda/lib64" XDG_RUNTIME_DIR="/run/user/1000" SSH_CLIENT="10.71.207.213 21376 22" CONDA_DEFAULT_ENV="colossalai" LC_TIME="zh_CN.UTF-8" XDG_DATA_DIRS="/usr/local/share:/usr/share:/var/lib/snapd/desktop" PATH="/home/chen/anaconda3/envs/colossalai/bin:/home/chen/anaconda3/condabin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin" DBUS_SESSION_BUS_ADDRESS="unix:path=/run/user/1000/bus" SSH_TTY="/dev/pts/0" CONDA_PREFIX_1="/home/chen/anaconda3" LC_NUMERIC="zh_CN.UTF-8" _="/home/chen/anaconda3/envs/colossalai/bin/colossalai" OLDPWD="/mnt/data/colossalai/ColossalAI/examples/tutorial" && torchrun --nproc_per_node=2 --nnodes=1 --node_rank=0 --rdzv_backend=c10d --rdzv_endpoint=127.0.0.1:29500 --rdzv_id=colossalai-default-job auto_parallel_with_resnet.py'

Exit code: 1

Stdout: already printed

Stderr: already printed

Environment

Hardware: AMD Ryzen 9 3950X 16-Core Processor 2 x GTX 3060 (12G) 64GB system memory

Software: CUDA 11.7 cudnn 8.5.0.96 PyTorch torch-1.13.1+cu117-cp39-cp39-linux_x86_64 torchvision-0.14.1+cu117-cp39-cp39-linux_x86_64

captainst avatar Feb 26 '23 14:02 captainst

Sorry @captainst , I believe this example can only run with 4 GPUs. Can you try modifying this line to cater to your own case with only 2 GPUs?

JThh avatar Feb 26 '23 15:02 JThh

@JThh Thank you! I tried to edit the line to: device_mesh = DeviceMesh(physical_mesh_id=torch.tensor([0, 1]), mesh_shape=[2], init_process_group=True) and device_mesh = DeviceMesh(physical_mesh_id=torch.tensor([0, 1]), mesh_shape=[[2]], init_process_group=True) Neither works. Could you tell how to modify this line so that the example can work on a single node with 2 x 3060 card ?

Thank you again !

captainst avatar Feb 27 '23 01:02 captainst

Can @super-dainiu or @Cypher30 help answer this?

JThh avatar Feb 27 '23 03:02 JThh

cc @YuliangLiu0306

super-dainiu avatar Feb 27 '23 11:02 super-dainiu

Bot detected the issue body's language is not English, translate it automatically. πŸ‘―πŸ‘­πŸ»πŸ§‘β€πŸ€β€πŸ§‘πŸ‘«πŸ§‘πŸΏβ€πŸ€β€πŸ§‘πŸ»πŸ‘©πŸΎβ€πŸ€β€πŸ‘¨πŸΏπŸ‘¬πŸΏ


Cao Cao @Y U Liang L IU0306

Issues-translate-bot avatar Feb 27 '23 11:02 Issues-translate-bot

@captainst try this one: device_mesh = DeviceMesh(physical_mesh_id=torch.tensor([0, 1]), mesh_shape=[1,2], init_process_group=True)

YuliangLiu0306 avatar Mar 01 '23 03:03 YuliangLiu0306

We have updated a lot. This issue was closed due to inactivity. Thanks.

binmakeswell avatar Apr 26 '23 10:04 binmakeswell