ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

[BUG]: /bin/bash: line 0: export: `=/usr/bin/supervisord': not a valid identifier Error: failed to run torchrun --nproc_per_node=1 --nnodes=1 --node_rank=0 --rdzv_backend=c10d --rdzv_endpoint=127.0.0.1:29500 --rdzv_id=colossalai-default-job train.py --use_trainer on 127.0.0.1, is localhost: True, exception: Encountered a bad command exit code!

Open tang-ed opened this issue 1 year ago • 1 comments

🐛 Describe the bug

root@autodl-container-8450119b52-890be3f8:~# colossalai run --nproc_per_node 1 train.py --use_trainer /bin/bash: line 0: export: `=/usr/bin/supervisord': not a valid identifier Error: failed to run torchrun --nproc_per_node=1 --nnodes=1 --node_rank=0 --rdzv_backend=c10d --rdzv_endpoint=127.0.0.1:29500 --rdzv_id=colossalai-default-job train.py --use_trainer on 127.0.0.1, is localhost: True, exception: Encountered a bad command exit code!

Command: 'cd /root && export ="/usr/bin/supervisord" SHELL="/bin/bash" NV_LIBCUBLAS_VERSION="11.4.2.10064-1" NVIDIA_VISIBLE_DEVICES="GPU-f4c5eaa0-3871-0885-09b9-c73f33363172" NV_NVML_DEV_VERSION="11.3.58-1" NV_CUDNN_PACKAGE_NAME="libcudnn8" NV_LIBNCCL_DEV_PACKAGE="libnccl-dev=2.9.6-1+cuda11.3" NV_LIBNCCL_DEV_PACKAGE_VERSION="2.9.6-1" HOSTNAME="autodl-container-8450119b52-890be3f8" LANGUAGE="en_US:en" NVIDIA_REQUIRE_CUDA="cuda>=11.3 brand=tesla,driver>=418,driver<419 brand=tesla,driver>=440,driver<441 driver>=450" NV_LIBCUBLAS_DEV_PACKAGE="libcublas-dev-11-3=11.4.2.10064-1" NV_NVTX_VERSION="11.3.58-1" NV_ML_REPO_ENABLED="1" NV_CUDA_CUDART_DEV_VERSION="11.3.58-1" NV_LIBCUSPARSE_VERSION="11.5.0.58-1" NV_LIBNPP_VERSION="11.3.3.44-1" NCCL_VERSION="2.9.6-1" PWD="/root" AutoDLContainerUUID="8450119b52-890be3f8" NV_CUDNN_PACKAGE="libcudnn8=8.2.0.53-1+cuda11.3" NVIDIA_DRIVER_CAPABILITIES="compute,utility,graphics,video" JUPYTER_SERVER_URL="http://autodl-container-8450119b52-890be3f8:8888/jupyter/" NV_LIBNPP_PACKAGE="libnpp-11-3=11.3.3.44-1" NV_LIBNCCL_DEV_PACKAGE_NAME="libnccl-dev" TZ="Asia/Shanghai" NV_LIBCUBLAS_DEV_VERSION="11.4.2.10064-1" NV_LIBCUBLAS_DEV_PACKAGE_NAME="libcublas-dev-11-3" LINES="43" NV_CUDA_CUDART_VERSION="11.3.58-1" HOME="/root" LANG="en_US.UTF-8" LS_COLORS="rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:.tar=01;31:.tgz=01;31:.arc=01;31:.arj=01;31:.taz=01;31:.lha=01;31:.lz4=01;31:.lzh=01;31:.lzma=01;31:.tlz=01;31:.txz=01;31:.tzo=01;31:.t7z=01;31:.zip=01;31:.z=01;31:.dz=01;31:.gz=01;31:.lrz=01;31:.lz=01;31:.lzo=01;31:.xz=01;31:.zst=01;31:.tzst=01;31:.bz2=01;31:.bz=01;31:.tbz=01;31:.tbz2=01;31:.tz=01;31:.deb=01;31:.rpm=01;31:.jar=01;31:.war=01;31:.ear=01;31:.sar=01;31:.rar=01;31:.alz=01;31:.ace=01;31:.zoo=01;31:.cpio=01;31:.7z=01;31:.rz=01;31:.cab=01;31:.wim=01;31:.swm=01;31:.dwm=01;31:.esd=01;31:.jpg=01;35:.jpeg=01;35:.mjpg=01;35:.mjpeg=01;35:.gif=01;35:.bmp=01;35:.pbm=01;35:.pgm=01;35:.ppm=01;35:.tga=01;35:.xbm=01;35:.xpm=01;35:.tif=01;35:.tiff=01;35:.png=01;35:.svg=01;35:.svgz=01;35:.mng=01;35:.pcx=01;35:.mov=01;35:.mpg=01;35:.mpeg=01;35:.m2v=01;35:.mkv=01;35:.webm=01;35:.ogm=01;35:.mp4=01;35:.m4v=01;35:.mp4v=01;35:.vob=01;35:.qt=01;35:.nuv=01;35:.wmv=01;35:.asf=01;35:.rm=01;35:.rmvb=01;35:.flc=01;35:.avi=01;35:.fli=01;35:.flv=01;35:.gl=01;35:.dl=01;35:.xcf=01;35:.xwd=01;35:.yuv=01;35:.cgm=01;35:.emf=01;35:.ogv=01;35:.ogx=01;35:.aac=00;36:.au=00;36:.flac=00;36:.m4a=00;36:.mid=00;36:.midi=00;36:.mka=00;36:.mp3=00;36:.mpc=00;36:.ogg=00;36:.ra=00;36:.wav=00;36:.oga=00;36:.opus=00;36:.spx=00;36:*.xspf=00;36:" COLUMNS="224" AutoDLRegion="beijing-B" CUDA_VERSION="11.3.0" AgentHost="10.0.0.123" NV_LIBCUBLAS_PACKAGE="libcublas-11-3=11.4.2.10064-1" PYDEVD_USE_FRAME_EVAL="NO" NV_LIBNPP_DEV_PACKAGE="libnpp-dev-11-3=11.3.3.44-1" NV_LIBCUBLAS_PACKAGE_NAME="libcublas-11-3" NV_LIBNPP_DEV_VERSION="11.3.3.44-1" JUPYTER_SERVER_ROOT="/root" TERM="xterm-256color" NV_ML_REPO_URL="https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu2004/x86_64" NV_LIBCUSPARSE_DEV_VERSION="11.5.0.58-1" LIBRARY_PATH="/usr/local/cuda/lib64/stubs" NV_CUDNN_VERSION="8.2.0.53" AutodlAutoPanelToken="jupyter-autodl-container-8450119b52-890be3f8-2c8e4ae4c664e48f6bf3be65393db755272ea24bb39a346efa289508f2bb50031" SHLVL="2" PYXTERM_DIMENSIONS="80x25" NV_CUDA_LIB_VERSION="11.3.0-1" NVARCH="x86_64" NV_CUDNN_PACKAGE_DEV="libcudnn8-dev=8.2.0.53-1+cuda11.3" NV_CUDA_COMPAT_PACKAGE="cuda-compat-11-3" NV_LIBNCCL_PACKAGE="libnccl2=2.9.6-1+cuda11.3" LD_LIBRARY_PATH="/usr/local/nvidia/lib:/usr/local/nvidia/lib64" LC_CTYPE="C.UTF-8" OMP_NUM_THREADS="12" PATH="/root/miniconda3/bin:/usr/local/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin" NV_LIBNCCL_PACKAGE_NAME="libnccl2" NV_LIBNCCL_PACKAGE_VERSION="2.9.6-1" MKL_NUM_THREADS="12" DEBIAN_FRONTEND="noninteractive" _="/root/miniconda3/bin/colossalai" && torchrun --nproc_per_node=1 --nnodes=1 --node_rank=0 --rdzv_backend=c10d --rdzv_endpoint=127.0.0.1:29500 --rdzv_id=colossalai-default-job train.py --use_trainer'

Exit code: 1

Stdout: already printed

Stderr: already printed

====== Training on All Nodes ===== 127.0.0.1: failure

====== Stopping All Nodes ===== 127.0.0.1: finish

Environment

pytorch1.11.0 2080ti cuda11.3

tang-ed avatar Apr 25 '23 08:04 tang-ed

This is what I ran on a paid platform using Resnet as a case study

tang-ed avatar Apr 25 '23 08:04 tang-ed

I think it is due to a mismatch of your nvidia runtime (ubuntu) with your os environment debian. You might want to alter either of these to match.

JThh avatar Apr 26 '23 05:04 JThh

I am able to run any pytorch program normally, except for the example provided by Colossal AI

tang-ed avatar Apr 26 '23 06:04 tang-ed

The reason is /usr/bin/supervisord is taken as the name of the variable instead of values to export. It happens in the line:cd /root && export ="/usr/bin/supervisord". There is redundant empty spaces after export, which you should seek to erase.

JThh avatar Apr 26 '23 07:04 JThh

I probably know the problem. But I have a question, is there a problem with my local script file, or is there a problem with downloading the script file that comes with Colorsal AI?

tang-ed avatar Apr 26 '23 08:04 tang-ed

Can you try torchrun command directly and see if the error persists?

JThh avatar Apr 26 '23 09:04 JThh

Are you talking about this command----- python -m torch.distributed.launch --nproc_ per_ node=1。 I have also tried using this command, but there was no error prompt

tang-ed avatar Apr 26 '23 09:04 tang-ed

There was no error but it hanged? Or did it run normally?

JThh avatar Apr 26 '23 10:04 JThh

It can operate normally and receive feedback on training. But I'm not sure if the speed is okay.

tang-ed avatar Apr 26 '23 10:04 tang-ed

If it ran okay, the speed would not be compromised given you used torchrun.

JThh avatar Apr 26 '23 11:04 JThh

Okay, that can be said that the problem was solved in another way. Can only use -- Python - m torch. distributed. launch -- nproc_ per_ Node=1 train. py - Start the program. Cannot use -- colosalai run -- nproc_ per_ node 1 train.py --use_ Trainer - started the program, right? These two commands will not actually have any impact, will they?

tang-ed avatar Apr 26 '23 11:04 tang-ed

No, they will not make a difference in performance.

JThh avatar Apr 26 '23 12:04 JThh