ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

training issue

Open MaleekaA opened this issue 1 year ago • 1 comments

if i run any exampels or training setup from applications in colossalAI I get this issue can you help me in solving this issue?

RuntimeError: Stop_waiting response is expected Error: failed to run torchrun --nproc_per_node=4 --nnodes=1 --node_rank=0 --master_addr= --master_port= vit_train_demo.py --model_name_or_path google/vit-base-patch16-224 --output_path ./home/jovyan/malika/ColossalAI/examples/images/vit --plugin hybrid_parallel --batch_size 8 --tp_size 4 --pp_size 1 --num_epoch 3 --learning_rate 2e-4 --weight_decay 0.05 --warmup_ratio 0.3 on 127.0.0.1, is localhost: True, exception: Encountered a bad command exit code!

Command: 'cd /home/jovyan/malika/ColossalAI/examples/images/vit && export SHELL="/bin/bash" COLORTERM="truecolor" TERM_PROGRAM_VERSION="1.90.2" LC_ADDRESS="ko_KR.UTF-8" LC_NAME="ko_KR.UTF-8" LC_MONETARY="ko_KR.UTF-8" PWD="/home/jovyan/malika/ColossalAI/examples/images/vit" LOGNAME="jovyan" NCCL_DEBUG="INFO" VSCODE_GIT_ASKPASS_NODE="/home/jovyan/.vscode-server/cli/servers/Stable-5437499feb04f7a586f677b155b039bc2b3669eb/server/node" MOTD_SHOWN="pam" HOME="/home/jovyan" LANG="ko_KR.UTF-8" LC_PAPER="ko_KR.UTF-8" LS_COLORS="rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:.tar=01;31:.tgz=01;31:.arc=01;31:.arj=01;31:.taz=01;31:.lha=01;31:.lz4=01;31:.lzh=01;31:.lzma=01;31:.tlz=01;31:.txz=01;31:.tzo=01;31:.t7z=01;31:.zip=01;31:.z=01;31:.dz=01;31:.gz=01;31:.lrz=01;31:.lz=01;31:.lzo=01;31:.xz=01;31:.zst=01;31:.tzst=01;31:.bz2=01;31:.bz=01;31:.tbz=01;31:.tbz2=01;31:.tz=01;31:.deb=01;31:.rpm=01;31:.jar=01;31:.war=01;31:.ear=01;31:.sar=01;31:.rar=01;31:.alz=01;31:.ace=01;31:.zoo=01;31:.cpio=01;31:.7z=01;31:.rz=01;31:.cab=01;31:.wim=01;31:.swm=01;31:.dwm=01;31:.esd=01;31:.jpg=01;35:.jpeg=01;35:.mjpg=01;35:.mjpeg=01;35:.gif=01;35:.bmp=01;35:.pbm=01;35:.pgm=01;35:.ppm=01;35:.tga=01;35:.xbm=01;35:.xpm=01;35:.tif=01;35:.tiff=01;35:.png=01;35:.svg=01;35:.svgz=01;35:.mng=01;35:.pcx=01;35:.mov=01;35:.mpg=01;35:.mpeg=01;35:.m2v=01;35:.mkv=01;35:.webm=01;35:.ogm=01;35:.mp4=01;35:.m4v=01;35:.mp4v=01;35:.vob=01;35:.qt=01;35:.nuv=01;35:.wmv=01;35:.asf=01;35:.rm=01;35:.rmvb=01;35:.flc=01;35:.avi=01;35:.fli=01;35:.flv=01;35:.gl=01;35:.dl=01;35:.xcf=01;35:.xwd=01;35:.yuv=01;35:.cgm=01;35:.emf=01;35:.ogv=01;35:.ogx=01;35:.aac=00;36:.au=00;36:.flac=00;36:.m4a=00;36:.mid=00;36:.midi=00;36:.mka=00;36:.mp3=00;36:.mpc=00;36:.ogg=00;36:.ra=00;36:.wav=00;36:.oga=00;36:.opus=00;36:.spx=00;36:*.xspf=00;36:" VIRTUAL_ENV="/home/jovyan/.venv/torch2.3.0-py3.10-cuda11.8" SSL_CERT_DIR="/usr/lib/ssl/certs" GIT_ASKPASS="/home/jovyan/.vscode-server/cli/servers/Stable-5437499feb04f7a586f677b155b039bc2b3669eb/server/extensions/git/dist/askpass.sh" SSH_CONNECTION="10.0.0.137 60450 10.0.31.75 22" CUDA_VISIBLE_DEVICES="0,1,2,3" LC_IDENTIFICATION="ko_KR.UTF-8" TERM="xterm-256color" USER="jovyan" VISIBLE="now" VSCODE_GIT_IPC_HANDLE="/tmp/vscode-git-044e287697.sock" SHLVL="2" LC_TELEPHONE="ko_KR.UTF-8" LC_MESSAGES="ko_KR.UTF-8" LC_MEASUREMENT="ko_KR.UTF-8" VIRTUAL_ENV_PROMPT="(torch2.3.0-py3.10-cuda11.8) " LD_LIBRARY_PATH="/usr/lib/nvidia:/usr/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu:/usr/local/cuda/lib64" LC_CTYPE="ko_KR.UTF-8" SSL_CERT_FILE="/usr/lib/ssl/certs/ca-certificates.crt" SSH_CLIENT="10.0.0.137 60450 22" LC_TIME="ko_KR.UTF-8" OMP_NUM_THREADS="1" VSCODE_GIT_ASKPASS_MAIN="/home/jovyan/.vscode-server/cli/servers/Stable-5437499feb04f7a586f677b155b039bc2b3669eb/server/extensions/git/dist/askpass-main.js" CUDA_HOME="/usr/local/cuda" LC_COLLATE="ko_KR.UTF-8" GCC_COLORS="error=01;31:warning=01;35:note=01;36:caret=01;32:locus=01:quote=01" BROWSER="/home/jovyan/.vscode-server/cli/servers/Stable-5437499feb04f7a586f677b155b039bc2b3669eb/server/bin/helpers/browser.sh" PATH="/usr/local/cuda/bin:/home/jovyan/.venv/torch2.3.0-py3.10-cuda11.8/bin:/home/jovyan/.vscode-server/cli/servers/Stable-5437499feb04f7a586f677b155b039bc2b3669eb/server/bin/remote-cli:/usr/local/cuda/bin:/home/jovyan/.venv/torch2.3.0-py3.10-cuda11.8/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin" LC_NUMERIC="ko_KR.UTF-8" OLDPWD="/home/jovyan/malika/ColossalAI/examples/images" TERM_PROGRAM="vscode" VSCODE_IPC_HOOK_CLI="/tmp/vscode-ipc-3dbaed2b-9e4b-4150-855d-699020003867.sock" _="/home/jovyan/.venv/torch2.3.0-py3.10-cuda11.8/bin/colossalai" && torchrun --nproc_per_node=4 --nnodes=1 --node_rank=0 --master_addr=210.125.69.5 --master_port=32309 vit_train_demo.py --model_name_or_path google/vit-base-patch16-224 --output_path ./home/jovyan/malika/ColossalAI/examples/images/vit --plugin hybrid_parallel --batch_size 8 --tp_size 4 --pp_size 1 --num_epoch 3 --learning_rate 2e-4 --weight_decay 0.05 --warmup_ratio 0.3'

Exit code: 1

Stdout: already printed

Stderr: already printed

====== Training on All Nodes ===== 127.0.0.1: failure

====== Stopping All Nodes ===== 127.0.0.1: finish

MaleekaA avatar Jul 05 '24 07:07 MaleekaA

RuntimeError: Stop_waiting response is expected indicates that the problem is on torch's end. Please ensure your environment is properly set up (PyTorch version, CUDA) and re-run.

Edenzzzz avatar Jul 09 '24 08:07 Edenzzzz