[BUG]: ModuleNotFoundError: No module named 'colossalai._C.cpu_adam'
🐛 Describe the bug
I got some errors when running resnet.
`(colossal-AI) [root@node64 resnet]# colossalai run --nproc_per_node 1 train.py -c ./ckpt-fp32
[07/25/23 20:27:25] INFO colossalai - colossalai - INFO:
/root/anaconda3/envs/colossal-AI/lib/python3.9/site
-packages/colossalai/context/parallel_context.py:52
2 set_device
INFO colossalai - colossalai - INFO: process rank 0 is
bound to device 0
[07/25/23 20:27:28] INFO colossalai - colossalai - INFO:
/root/anaconda3/envs/colossal-AI/lib/python3.9/site
-packages/colossalai/context/parallel_context.py:55
8 set_seed
INFO colossalai - colossalai - INFO: initialized seed on
rank 0, numpy: 1024, python random: 1024,
ParallelMode.DATA: 1024, ParallelMode.TENSOR:
1024,the default parallel seed is
ParallelMode.DATA.
INFO colossalai - colossalai - INFO:
/root/anaconda3/envs/colossal-AI/lib/python3.9/site
-packages/colossalai/initialize.py:115 launch
INFO colossalai - colossalai - INFO: Distributed
environment is initialized, data parallel size: 1,
pipeline parallel size: 1, tensor parallel size: 1
/root/anaconda3/envs/colossal-AI/lib/python3.9/site-packages/colossalai/booster/booster.py:69: UserWarning: The plugin will control the accelerator, so the device argument will be ignored.
warnings.warn('The plugin will control the accelerator, so the device argument will be ignored.')
Files already downloaded and verified
Files already downloaded and verified
/root/anaconda3/envs/colossal-AI/lib/python3.9/site-packages/torch/utils/cpp_extension.py:329: UserWarning:
!! WARNING !!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Your compiler (c++ 4.8.5) may be ABI-incompatible with PyTorch! Please use a compiler that is ABI-compatible with GCC 5.0 and above. See https://gcc.gnu.org/onlinedocs/libstdc++/manual/abi.html.
See https://gist.github.com/goldsborough/d466f43e8ffc948ff92de7486c5216d6 for instructions on how to install GCC 5 or higher. !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!! WARNING !!
warnings.warn(ABI_INCOMPATIBILITY_WARNING.format(compiler))
Traceback (most recent call last):
File "/root/anaconda3/envs/colossal-AI/lib/python3.9/site-packages/colossalai/kernel/op_builder/builder.py", line 161, in load
op_module = self.import_op()
File "/root/anaconda3/envs/colossal-AI/lib/python3.9/site-packages/colossalai/kernel/op_builder/builder.py", line 110, in import_op
return importlib.import_module(self.prebuilt_import_path)
File "/root/anaconda3/envs/colossal-AI/lib/python3.9/importlib/init.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/root/anaconda3/envs/colossal-AI/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1740, in _run_ninja_build subprocess.run( File "/root/anaconda3/envs/colossal-AI/lib/python3.9/subprocess.py", line 528, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/z00621429/ColossalAI/examples/images/resnet/train.py", line 204, in
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 18372) of binary: /root/anaconda3/envs/colossal-AI/bin/python
Traceback (most recent call last):
File "/root/anaconda3/envs/colossal-AI/bin/torchrun", line 33, in
sys.exit(load_entry_point('torch==1.11.0', 'console_scripts', 'torchrun')())
File "/root/anaconda3/envs/colossal-AI/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper
return f(*args, **kwargs)
File "/root/anaconda3/envs/colossal-AI/lib/python3.9/site-packages/torch/distributed/run.py", line 724, in main
run(args)
File "/root/anaconda3/envs/colossal-AI/lib/python3.9/site-packages/torch/distributed/run.py", line 715, in run
elastic_launch(
File "/root/anaconda3/envs/colossal-AI/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/anaconda3/envs/colossal-AI/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
train.py FAILED
Failures: <NO_OTHER_FAILURES>
Root Cause (first observed failure): [0]: time : 2023-07-25_20:27:34 host : node64 rank : 0 (local_rank: 0) exitcode : 1 (pid: 18372) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Error: failed to run torchrun --nproc_per_node=1 --nnodes=1 --node_rank=0 --rdzv_backend=c10d --rdzv_endpoint=127.0.0.1:29500 --rdzv_id=colossalai-default-job train.py -c ./ckpt-fp32 on 127.0.0.1, is localhost: True, exception: Encountered a bad command exit code!
Command: 'cd /home/z00621429/ColossalAI/examples/images/resnet && export MANPATH="/usr/share/lmod/lmod/share/man:/usr/local/Modules/share/man:" XDG_SESSION_ID="4892" HOSTNAME="node64" MODULES_CMD="/usr/local/Modules/libexec/modulecmd.tcl" TERM="xterm" SHELL="/bin/bash" LMOD_ROOT="/usr/share/lmod" HISTSIZE="1000" MODULEPATH_ROOT="/usr/share/modulefiles" SSH_CLIENT="90.253.30.126 56175 22" CONDA_SHLVL="1" CONDA_PROMPT_MODIFIER="(colossal-AI) " LMOD_PKG="/usr/share/lmod/lmod" OLDPWD="/root" LMOD_VERSION="8.2.7" SSH_TTY="/dev/pts/0" http_proxy="http://ptaishanpublic2:[email protected]:8080" USER="root" LMOD_sys="Linux" LS_COLORS="rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=01;05;37;41:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:.tar=01;31:.tgz=01;31:.arc=01;31:.arj=01;31:.taz=01;31:.lha=01;31:.lz4=01;31:.lzh=01;31:.lzma=01;31:.tlz=01;31:.txz=01;31:.tzo=01;31:.t7z=01;31:.zip=01;31:.z=01;31:.Z=01;31:.dz=01;31:.gz=01;31:.lrz=01;31:.lz=01;31:.lzo=01;31:.xz=01;31:.bz2=01;31:.bz=01;31:.tbz=01;31:.tbz2=01;31:.tz=01;31:.deb=01;31:.rpm=01;31:.jar=01;31:.war=01;31:.ear=01;31:.sar=01;31:.rar=01;31:.alz=01;31:.ace=01;31:.zoo=01;31:.cpio=01;31:.7z=01;31:.rz=01;31:.cab=01;31:.jpg=01;35:.jpeg=01;35:.gif=01;35:.bmp=01;35:.pbm=01;35:.pgm=01;35:.ppm=01;35:.tga=01;35:.xbm=01;35:.xpm=01;35:.tif=01;35:.tiff=01;35:.png=01;35:.svg=01;35:.svgz=01;35:.mng=01;35:.pcx=01;35:.mov=01;35:.mpg=01;35:.mpeg=01;35:.m2v=01;35:.mkv=01;35:.webm=01;35:.ogm=01;35:.mp4=01;35:.m4v=01;35:.mp4v=01;35:.vob=01;35:.qt=01;35:.nuv=01;35:.wmv=01;35:.asf=01;35:.rm=01;35:.rmvb=01;35:.flc=01;35:.avi=01;35:.fli=01;35:.flv=01;35:.gl=01;35:.dl=01;35:.xcf=01;35:.xwd=01;35:.yuv=01;35:.cgm=01;35:.emf=01;35:.axv=01;35:.anx=01;35:.ogv=01;35:.ogx=01;35:.aac=01;36:.au=01;36:.flac=01;36:.mid=01;36:.midi=01;36:.mka=01;36:.mp3=01;36:.mpc=01;36:.ogg=01;36:.ra=01;36:.wav=01;36:.axa=01;36:.oga=01;36:.spx=01;36:*.xspf=01;36:" LD_LIBRARY_PATH="/opt/ompi/lib:/opt/openblas/lib:/opt/math/lib:/opt/vasp/fftw/lib:/opt/vasp/scaLapack:/hpcshare/l00645749/libs/compiler/gcc/9.3.0/lib64:/hpcshare/l00645749/cuda-11.0/lib64" CONDA_EXE="/root/anaconda3/bin/conda" ENV="/usr/local/Modules/init/profile.sh" MAIL="/var/spool/mail/root" PATH="/root/anaconda3/envs/colossal-AI/bin:/root/anaconda3/condabin:/usr/local/Modules/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/bin:/opt/ompi/bin:/opt/ucx/bin:/opt/math/bin:/root/bin" CONDA_PREFIX="/root/anaconda3/envs/colossal-AI" PWD="/home/z00621429/ColossalAI/examples/images/resnet" LANG="en_US.UTF-8" MODULEPATH="/etc/modulefiles:/usr/share/modulefiles:/usr/share/modulefiles/Linux:/usr/share/modulefiles/Core:/usr/share/lmod/lmod/modulefiles/Core" LMOD_CMD="/usr/share/lmod/lmod/libexec/lmod" https_proxy="http://ptaishanpublic2:[email protected]:8080" HISTCONTROL="ignoredups" SHLVL="1" HOME="/root" no_proxy="127.0.0.1,.huawei.com,localhost,local,.local" BASH_ENV="/usr/share/lmod/lmod/init/bash" CONDA_PYTHON_EXE="/root/anaconda3/bin/python" LOGNAME="root" XDG_DATA_DIRS="/root/.local/share/flatpak/exports/share:/var/lib/flatpak/exports/share:/usr/local/share:/usr/share" SSH_CONNECTION="90.253.30.126 56175 90.91.33.64 22" MODULESHOME="/usr/share/lmod/lmod" CONDA_DEFAULT_ENV="colossal-AI" LMOD_SETTARG_FULL_SUPPORT="no" LESSOPEN="||/usr/bin/lesspipe.sh %s" XDG_RUNTIME_DIR="/run/user/0" DISPLAY="localhost:10.0" LMOD_DIR="/usr/share/lmod/lmod/libexec" _="/root/anaconda3/envs/colossal-AI/bin/colossalai" && torchrun --nproc_per_node=1 --nnodes=1 --node_rank=0 --rdzv_backend=c10d --rdzv_endpoint=127.0.0.1:29500 --rdzv_id=colossalai-default-job train.py -c ./ckpt-fp32'
Exit code: 1
Stdout: already printed
Stderr: already printed
====== Training on All Nodes ===== 127.0.0.1: failure
====== Stopping All Nodes ===== 127.0.0.1: finish `
Environment
python 3.9 pytorch 1.11 CUDA 11.3.r11.3 gcc (GCC) 7.3.1 20180303 (Red Hat 7.3.1-5)
How about install colossalai with CUDA_EXT=1 pip install colossalai?
Hi have you found a solution to this problem? I am encountering the same problem with colossalai-0.3.2, torch.2.2.0.dev+cu121 ,cuda12.2
Hi have you found a solution to this problem? I am encountering the same problem with colossalai-0.3.2, torch.2.2.0.dev+cu121 ,cuda12.2
hi ColossalAI does not support Torch 2.0 and above. Torch 1.13.1 is recommended.
Thanks! I heard colossal-ai was tested on h800. What env (cuda, torch) was used?
Has anybody found a solution to this? This is part of my trace. There is enough disk space. Please help. The environment is a singularity container with nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04 as base and python3.11 and torch2.4.0. Colossalai was built from source without BUILD_EXT=1 and the kernels were being compiled at runtime.
0: [default2]:[rank2]: op_kernel = load(
0: [default1]:[rank1]: File "/usr/local/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 2141, in _import_module_from_library
0: [default2]:[rank2]: ^^^^^
0: [default1]:[rank1]: module = importlib.util.module_from_spec(spec)
0: [default2]:[rank2]: File "/usr/local/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1312, in load
0: [default1]:[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
0: [default2]:[rank2]: return _jit_compile(
0: [default1]:[rank1]: File "