ColossalAI
ColossalAI copied to clipboard
[BUG]: /bin/bash: line 0: export: `=/usr/bin/supervisord': not a valid identifier Error: failed to run torchrun --nproc_per_node=1 --nnodes=1 --node_rank=0 --rdzv_backend=c10d --rdzv_endpoint=127.0.0.1:29500 --rdzv_id=colossalai-default-job train.py --use_trainer on 127.0.0.1, is localhost: True, exception: Encountered a bad command exit code!
🐛 Describe the bug
root@autodl-container-8450119b52-890be3f8:~# colossalai run --nproc_per_node 1 train.py --use_trainer /bin/bash: line 0: export: `=/usr/bin/supervisord': not a valid identifier Error: failed to run torchrun --nproc_per_node=1 --nnodes=1 --node_rank=0 --rdzv_backend=c10d --rdzv_endpoint=127.0.0.1:29500 --rdzv_id=colossalai-default-job train.py --use_trainer on 127.0.0.1, is localhost: True, exception: Encountered a bad command exit code!
Command: 'cd /root && export ="/usr/bin/supervisord" SHELL="/bin/bash" NV_LIBCUBLAS_VERSION="11.4.2.10064-1" NVIDIA_VISIBLE_DEVICES="GPU-f4c5eaa0-3871-0885-09b9-c73f33363172" NV_NVML_DEV_VERSION="11.3.58-1" NV_CUDNN_PACKAGE_NAME="libcudnn8" NV_LIBNCCL_DEV_PACKAGE="libnccl-dev=2.9.6-1+cuda11.3" NV_LIBNCCL_DEV_PACKAGE_VERSION="2.9.6-1" HOSTNAME="autodl-container-8450119b52-890be3f8" LANGUAGE="en_US:en" NVIDIA_REQUIRE_CUDA="cuda>=11.3 brand=tesla,driver>=418,driver<419 brand=tesla,driver>=440,driver<441 driver>=450" NV_LIBCUBLAS_DEV_PACKAGE="libcublas-dev-11-3=11.4.2.10064-1" NV_NVTX_VERSION="11.3.58-1" NV_ML_REPO_ENABLED="1" NV_CUDA_CUDART_DEV_VERSION="11.3.58-1" NV_LIBCUSPARSE_VERSION="11.5.0.58-1" NV_LIBNPP_VERSION="11.3.3.44-1" NCCL_VERSION="2.9.6-1" PWD="/root" AutoDLContainerUUID="8450119b52-890be3f8" NV_CUDNN_PACKAGE="libcudnn8=8.2.0.53-1+cuda11.3" NVIDIA_DRIVER_CAPABILITIES="compute,utility,graphics,video" JUPYTER_SERVER_URL="http://autodl-container-8450119b52-890be3f8:8888/jupyter/" NV_LIBNPP_PACKAGE="libnpp-11-3=11.3.3.44-1" NV_LIBNCCL_DEV_PACKAGE_NAME="libnccl-dev" TZ="Asia/Shanghai" NV_LIBCUBLAS_DEV_VERSION="11.4.2.10064-1" NV_LIBCUBLAS_DEV_PACKAGE_NAME="libcublas-dev-11-3" LINES="43" NV_CUDA_CUDART_VERSION="11.3.58-1" HOME="/root" LANG="en_US.UTF-8" LS_COLORS="rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:.tar=01;31:.tgz=01;31:.arc=01;31:.arj=01;31:.taz=01;31:.lha=01;31:.lz4=01;31:.lzh=01;31:.lzma=01;31:.tlz=01;31:.txz=01;31:.tzo=01;31:.t7z=01;31:.zip=01;31:.z=01;31:.dz=01;31:.gz=01;31:.lrz=01;31:.lz=01;31:.lzo=01;31:.xz=01;31:.zst=01;31:.tzst=01;31:.bz2=01;31:.bz=01;31:.tbz=01;31:.tbz2=01;31:.tz=01;31:.deb=01;31:.rpm=01;31:.jar=01;31:.war=01;31:.ear=01;31:.sar=01;31:.rar=01;31:.alz=01;31:.ace=01;31:.zoo=01;31:.cpio=01;31:.7z=01;31:.rz=01;31:.cab=01;31:.wim=01;31:.swm=01;31:.dwm=01;31:.esd=01;31:.jpg=01;35:.jpeg=01;35:.mjpg=01;35:.mjpeg=01;35:.gif=01;35:.bmp=01;35:.pbm=01;35:.pgm=01;35:.ppm=01;35:.tga=01;35:.xbm=01;35:.xpm=01;35:.tif=01;35:.tiff=01;35:.png=01;35:.svg=01;35:.svgz=01;35:.mng=01;35:.pcx=01;35:.mov=01;35:.mpg=01;35:.mpeg=01;35:.m2v=01;35:.mkv=01;35:.webm=01;35:.ogm=01;35:.mp4=01;35:.m4v=01;35:.mp4v=01;35:.vob=01;35:.qt=01;35:.nuv=01;35:.wmv=01;35:.asf=01;35:.rm=01;35:.rmvb=01;35:.flc=01;35:.avi=01;35:.fli=01;35:.flv=01;35:.gl=01;35:.dl=01;35:.xcf=01;35:.xwd=01;35:.yuv=01;35:.cgm=01;35:.emf=01;35:.ogv=01;35:.ogx=01;35:.aac=00;36:.au=00;36:.flac=00;36:.m4a=00;36:.mid=00;36:.midi=00;36:.mka=00;36:.mp3=00;36:.mpc=00;36:.ogg=00;36:.ra=00;36:.wav=00;36:.oga=00;36:.opus=00;36:.spx=00;36:*.xspf=00;36:" COLUMNS="224" AutoDLRegion="beijing-B" CUDA_VERSION="11.3.0" AgentHost="10.0.0.123" NV_LIBCUBLAS_PACKAGE="libcublas-11-3=11.4.2.10064-1" PYDEVD_USE_FRAME_EVAL="NO" NV_LIBNPP_DEV_PACKAGE="libnpp-dev-11-3=11.3.3.44-1" NV_LIBCUBLAS_PACKAGE_NAME="libcublas-11-3" NV_LIBNPP_DEV_VERSION="11.3.3.44-1" JUPYTER_SERVER_ROOT="/root" TERM="xterm-256color" NV_ML_REPO_URL="https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu2004/x86_64" NV_LIBCUSPARSE_DEV_VERSION="11.5.0.58-1" LIBRARY_PATH="/usr/local/cuda/lib64/stubs" NV_CUDNN_VERSION="8.2.0.53" AutodlAutoPanelToken="jupyter-autodl-container-8450119b52-890be3f8-2c8e4ae4c664e48f6bf3be65393db755272ea24bb39a346efa289508f2bb50031" SHLVL="2" PYXTERM_DIMENSIONS="80x25" NV_CUDA_LIB_VERSION="11.3.0-1" NVARCH="x86_64" NV_CUDNN_PACKAGE_DEV="libcudnn8-dev=8.2.0.53-1+cuda11.3" NV_CUDA_COMPAT_PACKAGE="cuda-compat-11-3" NV_LIBNCCL_PACKAGE="libnccl2=2.9.6-1+cuda11.3" LD_LIBRARY_PATH="/usr/local/nvidia/lib:/usr/local/nvidia/lib64" LC_CTYPE="C.UTF-8" OMP_NUM_THREADS="12" PATH="/root/miniconda3/bin:/usr/local/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin" NV_LIBNCCL_PACKAGE_NAME="libnccl2" NV_LIBNCCL_PACKAGE_VERSION="2.9.6-1" MKL_NUM_THREADS="12" DEBIAN_FRONTEND="noninteractive" _="/root/miniconda3/bin/colossalai" && torchrun --nproc_per_node=1 --nnodes=1 --node_rank=0 --rdzv_backend=c10d --rdzv_endpoint=127.0.0.1:29500 --rdzv_id=colossalai-default-job train.py --use_trainer'
Exit code: 1
Stdout: already printed
Stderr: already printed
====== Training on All Nodes ===== 127.0.0.1: failure
====== Stopping All Nodes ===== 127.0.0.1: finish
Environment
pytorch1.11.0 2080ti cuda11.3
This is what I ran on a paid platform using Resnet as a case study
I think it is due to a mismatch of your nvidia runtime (ubuntu
) with your os environment debian
. You might want to alter either of these to match.
I am able to run any pytorch program normally, except for the example provided by Colossal AI
The reason is /usr/bin/supervisord
is taken as the name of the variable instead of values to export. It happens in the line:cd /root && export ="/usr/bin/supervisord"
. There is redundant empty spaces after export, which you should seek to erase.
I probably know the problem. But I have a question, is there a problem with my local script file, or is there a problem with downloading the script file that comes with Colorsal AI?
Can you try torchrun
command directly and see if the error persists?
Are you talking about this command----- python -m torch.distributed.launch --nproc_ per_ node=1。 I have also tried using this command, but there was no error prompt
There was no error but it hanged? Or did it run normally?
It can operate normally and receive feedback on training. But I'm not sure if the speed is okay.
If it ran okay, the speed would not be compromised given you used torchrun
.
Okay, that can be said that the problem was solved in another way. Can only use -- Python - m torch. distributed. launch -- nproc_ per_ Node=1 train. py - Start the program. Cannot use -- colosalai run -- nproc_ per_ node 1 train.py --use_ Trainer - started the program, right? These two commands will not actually have any impact, will they?
No, they will not make a difference in performance.