ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

[BUG]:

Open superhg opened this issue 2 years ago β€’ 4 comments

run gemini example failed

`when run gemini example demo, below error msg occurs: [W socket.cpp:601] [c10d] The client socket has failed to connect to [::ffff:10.19.49.102]:35027 (errno: 110 - Connection timed out). [W socket.cpp:601] [c10d] The client socket has failed to connect to [::ffff:10.19.49.102]:35027 (errno: 110 - Connection timed out). [W socket.cpp:601] [c10d] The client socket has failed to connect to 10.19.49.102:35027 (errno: 110 - Connection timed out). [E socket.cpp:657] [c10d] The client socket has failed to connect to any network address of (bj-zjy-64c768g-049102-24g-3090-002.zxy, 35027). [W socket.cpp:601] [c10d] The client socket has failed to connect to 10.19.49.102:35027 (errno: 110 - Connection timed out). [E socket.cpp:657] [c10d] The client socket has failed to connect to any network address of (bj-zjy-64c768g-049102-24g-3090-002.zxy, 35027). Traceback (most recent call last): File "train_gpt.py", line 114, in Traceback (most recent call last): File "train_gpt.py", line 114, in File "train_gpt.py", line 39, in main else: File "/home/hegang1/home/data/zhoudao/anaconda3/envs/mfa-aligner/lib/python3.8/site-packages/colossalai/initialize.py", line 219, in launch_from_torch File "train_gpt.py", line 39, in main launch(config=config,else:

File "/home/hegang1/home/data/zhoudao/anaconda3/envs/mfa-aligner/lib/python3.8/site-packages/colossalai/initialize.py", line 99, in launch File "/home/hegang1/home/data/zhoudao/anaconda3/envs/mfa-aligner/lib/python3.8/site-packages/colossalai/initialize.py", line 219, in launch_from_torch gpc.init_global_dist(rank, world_size, backend, host, port) File "/home/hegang1/home/data/zhoudao/anaconda3/envs/mfa-aligner/lib/python3.8/site-packages/colossalai/context/parallel_context.py", line 374, in init_global_dist launch(config=config, File "/home/hegang1/home/data/zhoudao/anaconda3/envs/mfa-aligner/lib/python3.8/site-packages/colossalai/initialize.py", line 99, in launch dist.init_process_group(rank=rank, world_size=world_size, backend=backend, init_method=init_method) File "/home/hegang1/home/data/zhoudao/anaconda3/envs/mfa-aligner/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 754, in init_process_group gpc.init_global_dist(rank, world_size, backend, host, port) File "/home/hegang1/home/data/zhoudao/anaconda3/envs/mfa-aligner/lib/python3.8/site-packages/colossalai/context/parallel_context.py", line 374, in init_global_dist store, rank, world_size = next(rendezvous_iterator) File "/home/hegang1/home/data/zhoudao/anaconda3/envs/mfa-aligner/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 201, in _tcp_rendezvous_handler dist.init_process_group(rank=rank, world_size=world_size, backend=backend, init_method=init_method) File "/home/hegang1/home/data/zhoudao/anaconda3/envs/mfa-aligner/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 754, in init_process_group store = _create_c10d_store(result.hostname, result.port, rank, world_size, timeout) File "/home/hegang1/home/data/zhoudao/anaconda3/envs/mfa-aligner/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 177, in _create_c10d_store return TCPStore( RuntimeError: The client socket has failed to connect to any network address of (bj-zjy-64c768g-049102-24g-3090-002.zxy, 35027). The client socket has failed to connect to 10.19.49.102:35027 (errno: 110 - Connection timed out). store, rank, world_size = next(rendezvous_iterator) File "/home/hegang1/home/data/zhoudao/anaconda3/envs/mfa-aligner/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 201, in _tcp_rendezvous_handler store = _create_c10d_store(result.hostname, result.port, rank, world_size, timeout) File "/home/hegang1/home/data/zhoudao/anaconda3/envs/mfa-aligner/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 177, in _create_c10d_store return TCPStore( RuntimeError: The client socket has failed to connect to any network address of (bj-zjy-64c768g-049102-24g-3090-002.zxy, 35027). The client socket has failed to connect to 10.19.49.102:35027 (errno: 110 - Connection timed out). ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 8427) of binary: /home/hegang1/home/data/zhoudao/anaconda3/envs/mfa-aligner/bin/python3.8 Traceback (most recent call last): File "/home/hegang1/home/data/zhoudao/anaconda3/envs/mfa-aligner/bin/torchrun", line 8, in sys.exit(main()) File "/home/hegang1/home/data/zhoudao/anaconda3/envs/mfa-aligner/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, **kwargs) File "/home/hegang1/home/data/zhoudao/anaconda3/envs/mfa-aligner/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main run(args) File "/home/hegang1/home/data/zhoudao/anaconda3/envs/mfa-aligner/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run elastic_launch( File "/home/hegang1/home/data/zhoudao/anaconda3/envs/mfa-aligner/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/hegang1/home/data/zhoudao/anaconda3/envs/mfa-aligner/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train_gpt.py FAILED

Failures: [1]: time : 2023-02-22_14:32:50 host : bj-zjy-64c768g-049102-24g-3090-002.zxy rank : 1 (local_rank: 1) exitcode : 1 (pid: 8428) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2023-02-22_14:32:50 host : bj-zjy-64c768g-049102-24g-3090-002.zxy rank : 0 (local_rank: 0) exitcode : 1 (pid: 8427) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Error: failed to run torchrun --nproc_per_node=2 --nnodes=1 --node_rank=0 --rdzv_backend=c10d --rdzv_endpoint=127.0.0.1:29500 --rdzv_id=colossalai-default-job train_gpt.py --config ./configs/gpt2_small_zero3_pp1d.py --from_torch --use_dummy_dataset on 127.0.0.1, is localhost: True, exception: Encountered a bad command exit code!

Command: 'cd /home/hegang1/home/data/hegang/gpt/examples/language/gpt/titans && export CUDA_PATH="/mnt/NFS1/tools/cuda-11.1" XDG_SESSION_ID="15558" HOSTNAME="bj-zjy-64c768g-049102-24g-3090-002.zxy" TERM_PROGRAM="vscode" SHELL="/bin/bash" TERM="xterm-256color" HISTSIZE="1000" SSH_CLIENT="10.113.9.168 53666 22" CONDA_SHLVL="2" CONDA_PROMPT_MODIFIER="(mfa-aligner) " TERM_PROGRAM_VERSION="1.70.1" CUDA_HOME="/mnt/NFS1/tools/cuda-11.1" KALDI_DIR="/home/hegang1/home/data/hegang/data/kaldi" USER="hegang1" LD_LIBRARY_PATH="/mnt/NFS1/tools/gcc-7.5.0/lib64:/mnt/NFS1/tools/cuda-11.1/lib64:/mnt/NFS1/tools/cuda-11.1/lib64:/mnt/NFS1/tools/cuda-11.1/lib64::/home/hegang1/home/data/zhoudao/anaconda3/lib" LS_COLORS="rs=0:di=38;5;27:ln=38;5;51:mh=44;38;5;15:pi=40;38;5;11:so=38;5;13:do=38;5;5:bd=48;5;232;38;5;11:cd=48;5;232;38;5;3:or=48;5;232;38;5;9:mi=05;48;5;232;38;5;15:su=48;5;196;38;5;15:sg=48;5;11;38;5;16:ca=48;5;196;38;5;226:tw=48;5;10;38;5;16:ow=48;5;10;38;5;21:st=48;5;21;38;5;15:ex=38;5;34:.tar=38;5;9:.tgz=38;5;9:.arc=38;5;9:.arj=38;5;9:.taz=38;5;9:.lha=38;5;9:.lz4=38;5;9:.lzh=38;5;9:.lzma=38;5;9:.tlz=38;5;9:.txz=38;5;9:.tzo=38;5;9:.t7z=38;5;9:.zip=38;5;9:.z=38;5;9:.Z=38;5;9:.dz=38;5;9:.gz=38;5;9:.lrz=38;5;9:.lz=38;5;9:.lzo=38;5;9:.xz=38;5;9:.bz2=38;5;9:.bz=38;5;9:.tbz=38;5;9:.tbz2=38;5;9:.tz=38;5;9:.deb=38;5;9:.rpm=38;5;9:.jar=38;5;9:.war=38;5;9:.ear=38;5;9:.sar=38;5;9:.rar=38;5;9:.alz=38;5;9:.ace=38;5;9:.zoo=38;5;9:.cpio=38;5;9:.7z=38;5;9:.rz=38;5;9:.cab=38;5;9:.jpg=38;5;13:.jpeg=38;5;13:.gif=38;5;13:.bmp=38;5;13:.pbm=38;5;13:.pgm=38;5;13:.ppm=38;5;13:.tga=38;5;13:.xbm=38;5;13:.xpm=38;5;13:.tif=38;5;13:.tiff=38;5;13:.png=38;5;13:.svg=38;5;13:.svgz=38;5;13:.mng=38;5;13:.pcx=38;5;13:.mov=38;5;13:.mpg=38;5;13:.mpeg=38;5;13:.m2v=38;5;13:.mkv=38;5;13:.webm=38;5;13:.ogm=38;5;13:.mp4=38;5;13:.m4v=38;5;13:.mp4v=38;5;13:.vob=38;5;13:.qt=38;5;13:.nuv=38;5;13:.wmv=38;5;13:.asf=38;5;13:.rm=38;5;13:.rmvb=38;5;13:.flc=38;5;13:.avi=38;5;13:.fli=38;5;13:.flv=38;5;13:.gl=38;5;13:.dl=38;5;13:.xcf=38;5;13:.xwd=38;5;13:.yuv=38;5;13:.cgm=38;5;13:.emf=38;5;13:.axv=38;5;13:.anx=38;5;13:.ogv=38;5;13:.ogx=38;5;13:.aac=38;5;45:.au=38;5;45:.flac=38;5;45:.mid=38;5;45:.midi=38;5;45:.mka=38;5;45:.mp3=38;5;45:.mpc=38;5;45:.ogg=38;5;45:.ra=38;5;45:.wav=38;5;45:.axa=38;5;45:.oga=38;5;45:.spx=38;5;45:*.xspf=38;5;45:" CONDA_EXE="/home/hegang1/home/data/zhoudao/anaconda3/bin/conda" DATA="/data/scratch/gpt_data/small-gpt-dataset.json" CONDA_PREFIX_1="/home/hegang1/home/data/zhoudao/anaconda3" MAIL="/var/spool/mail/hegang1" PATH="/mnt/NFS1/tools/gcc-7.5.0/bin:/home/hegang1/home/data/zhoudao/anaconda3/envs/mfa-aligner/bin:/mnt/NFS1/tools/cuda-11.1/bin:/mnt/NFS1/tools/soundtouch/bin:/mnt/NFS1/tools/sox/bin:/home/hegang1/.vscode-server/bin/6d9b74a70ca9c7733b29f0456fd8195364076dda/bin/remote-cli:/mnt/NFS1/tools/cuda-11.1/bin:/mnt/NFS1/tools/soundtouch/bin:/mnt/NFS1/tools/sox/bin:/home/hegang1/home/data/zhoudao/anaconda3/bin:/home/hegang1/home/data/zhoudao/anaconda3/condabin:/mnt/NFS1/tools/cuda-11.1/bin:/mnt/NFS1/tools/soundtouch/bin:/mnt/NFS1/tools/sox/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/hegang1/.local/bin:/home/hegang1/bin" GSETTINGS_SCHEMA_DIR="/home/hegang1/home/data/zhoudao/anaconda3/envs/mfa-aligner/share/glib-2.0/schemas" CONDA_PREFIX="/home/hegang1/home/data/zhoudao/anaconda3/envs/mfa-aligner" PWD="/home/hegang1/home/data/hegang/gpt/examples/language/gpt/titans" LANG="en_US.UTF-8" HISTCONTROL="ignoredups" SHLVL="6" HOME="/home/hegang1" VSCODE_GIT_ASKPASS_MAIN="/home/hegang1/.vscode-server/bin/6d9b74a70ca9c7733b29f0456fd8195364076dda/extensions/git/dist/askpass-main.js" CONDA_PYTHON_EXE="/home/hegang1/home/data/zhoudao/anaconda3/bin/python" LOGNAME="hegang1" KALDI_ROOT="/home/hegang1/home/data/hegang/data/kaldi" SSH_CONNECTION="10.113.9.168 53666 10.19.102.58 22" VSCODE_GIT_IPC_HANDLE="/run/user/218022/vscode-git-3099238913.sock" VSCODE_IPC_HOOK_CLI="/run/user/218022/vscode-ipc-7e18044c-2ebb-4694-9126-3fcd6aa5a956.sock" CONDA_DEFAULT_ENV="mfa-aligner" LESSOPEN="||/usr/bin/lesspipe.sh %s" BROWSER="/home/hegang1/.vscode-server/bin/6d9b74a70ca9c7733b29f0456fd8195364076dda/bin/helpers/browser.sh" GIT_ASKPASS="/home/hegang1/.vscode-server/bin/6d9b74a70ca9c7733b29f0456fd8195364076dda/extensions/git/dist/askpass.sh" VSCODE_GIT_ASKPASS_NODE="/home/hegang1/.vscode-server/bin/6d9b74a70ca9c7733b29f0456fd8195364076dda/node" XDG_RUNTIME_DIR="/run/user/218022" COLORTERM="truecolor" _="/home/hegang1/home/data/zhoudao/anaconda3/envs/mfa-aligner/bin/colossalai" && torchrun --nproc_per_node=2 --nnodes=1 --node_rank=0 --rdzv_backend=c10d --rdzv_endpoint=127.0.0.1:29500 --rdzv_id=colossalai-default-job train_gpt.py --config ./configs/gpt2_small_zero3_pp1d.py --from_torch --use_dummy_dataset'

Exit code: 1

Stdout: already printed

Stderr: already printed

====== Training on All Nodes ===== 127.0.0.1: failure

====== Stopping All Nodes ===== 127.0.0.1: finish`

Environment

cuda:11.1 python: 3.8 pytorch:1.13.1 colossalai : 0.2.5 nvidia-cublas-cu11 11.10.3.66 pypi_0 pypi nvidia-cuda-nvrtc-cu11 11.7.99 pypi_0 pypi nvidia-cuda-runtime-cu11 11.7.99 pypi_0 pypi nvidia-cudnn-cu11 8.5.0.96

NCCL: none

superhg avatar Feb 22 '23 07:02 superhg

Bot detected the issue body's language is not English, translate it automatically. πŸ‘―πŸ‘­πŸ»πŸ§‘β€πŸ€β€πŸ§‘πŸ‘«πŸ§‘πŸΏβ€πŸ€β€πŸ§‘πŸ»πŸ‘©πŸΎβ€πŸ€β€πŸ‘¨πŸΏπŸ‘¬πŸΏ


Title: [BUG]:

Issues-translate-bot avatar Feb 22 '23 07:02 Issues-translate-bot

I have fixed this issue

joan126 avatar Feb 23 '23 06:02 joan126

I have fixed this issue

Hi @joan126 Could you please provide details of how to fix it? Contributions from the open source community are very welcome and appreciated! Thanks.

binmakeswell avatar Feb 28 '23 08:02 binmakeswell

Hi @joan126 Could you please provide details of how to fix it? Thanks!

Vvvvvvsysy avatar Mar 02 '23 04:03 Vvvvvvsysy

We have updated a lot. This issue was closed due to inactivity. Thanks.

binmakeswell avatar Apr 26 '23 10:04 binmakeswell