ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

[BUG]: ModuleNotFoundError: No module named 'colossalai._C.cpu_adam'

Open Aaricis opened this issue 2 years ago • 5 comments

🐛 Describe the bug

I got some errors when running resnet.

`(colossal-AI) [root@node64 resnet]# colossalai run --nproc_per_node 1 train.py -c ./ckpt-fp32 [07/25/23 20:27:25] INFO colossalai - colossalai - INFO:
/root/anaconda3/envs/colossal-AI/lib/python3.9/site -packages/colossalai/context/parallel_context.py:52 2 set_device
INFO colossalai - colossalai - INFO: process rank 0 is
bound to device 0
[07/25/23 20:27:28] INFO colossalai - colossalai - INFO:
/root/anaconda3/envs/colossal-AI/lib/python3.9/site -packages/colossalai/context/parallel_context.py:55 8 set_seed
INFO colossalai - colossalai - INFO: initialized seed on rank 0, numpy: 1024, python random: 1024,
ParallelMode.DATA: 1024, ParallelMode.TENSOR:
1024,the default parallel seed is
ParallelMode.DATA.
INFO colossalai - colossalai - INFO:
/root/anaconda3/envs/colossal-AI/lib/python3.9/site -packages/colossalai/initialize.py:115 launch
INFO colossalai - colossalai - INFO: Distributed
environment is initialized, data parallel size: 1, pipeline parallel size: 1, tensor parallel size: 1 /root/anaconda3/envs/colossal-AI/lib/python3.9/site-packages/colossalai/booster/booster.py:69: UserWarning: The plugin will control the accelerator, so the device argument will be ignored. warnings.warn('The plugin will control the accelerator, so the device argument will be ignored.') Files already downloaded and verified Files already downloaded and verified /root/anaconda3/envs/colossal-AI/lib/python3.9/site-packages/torch/utils/cpp_extension.py:329: UserWarning:

                           !! WARNING !!

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Your compiler (c++ 4.8.5) may be ABI-incompatible with PyTorch! Please use a compiler that is ABI-compatible with GCC 5.0 and above. See https://gcc.gnu.org/onlinedocs/libstdc++/manual/abi.html.

See https://gist.github.com/goldsborough/d466f43e8ffc948ff92de7486c5216d6 for instructions on how to install GCC 5 or higher. !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

                          !! WARNING !!

warnings.warn(ABI_INCOMPATIBILITY_WARNING.format(compiler)) Traceback (most recent call last): File "/root/anaconda3/envs/colossal-AI/lib/python3.9/site-packages/colossalai/kernel/op_builder/builder.py", line 161, in load op_module = self.import_op() File "/root/anaconda3/envs/colossal-AI/lib/python3.9/site-packages/colossalai/kernel/op_builder/builder.py", line 110, in import_op return importlib.import_module(self.prebuilt_import_path) File "/root/anaconda3/envs/colossal-AI/lib/python3.9/importlib/init.py", line 127, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "", line 1030, in _gcd_import File "", line 1007, in _find_and_load File "", line 984, in _find_and_load_unlocked ModuleNotFoundError: No module named 'colossalai._C.cpu_adam'

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/root/anaconda3/envs/colossal-AI/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1740, in _run_ninja_build subprocess.run( File "/root/anaconda3/envs/colossal-AI/lib/python3.9/subprocess.py", line 528, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/z00621429/ColossalAI/examples/images/resnet/train.py", line 204, in main() File "/home/z00621429/ColossalAI/examples/images/resnet/train.py", line 163, in main optimizer = HybridAdam(model.parameters(), lr=LEARNING_RATE) File "/root/anaconda3/envs/colossal-AI/lib/python3.9/site-packages/colossalai/nn/optimizer/hybrid_adam.py", line 82, in init cpu_optim = CPUAdamBuilder().load() File "/root/anaconda3/envs/colossal-AI/lib/python3.9/site-packages/colossalai/kernel/op_builder/builder.py", line 187, in load op_module = load(name=self.name, File "/root/anaconda3/envs/colossal-AI/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1144, in load return _jit_compile( File "/root/anaconda3/envs/colossal-AI/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1357, in _jit_compile _write_ninja_file_and_build_library( File "/root/anaconda3/envs/colossal-AI/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1469, in _write_ninja_file_and_build_library _run_ninja_build( File "/root/anaconda3/envs/colossal-AI/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1756, in _run_ninja_build raise RuntimeError(message) from e RuntimeError: Error building extension 'cpu_adam': [1/2] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_gcc" -DPYBIND11_STDLIB="_libstdcpp" -DPYBIND11_BUILD_ABI="_cxxabi1011" -I/root/anaconda3/envs/colossal-AI/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/includes -I/usr/local/cuda/include -isystem /root/anaconda3/envs/colossal-AI/lib/python3.9/site-packages/torch/include -isystem /root/anaconda3/envs/colossal-AI/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/colossal-AI/lib/python3.9/site-packages/torch/include/TH -isystem /root/anaconda3/envs/colossal-AI/lib/python3.9/site-packages/torch/include/THC -isystem /root/anaconda3/envs/colossal-AI/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -std=c++14 -lcudart -lcublas -g -Wno-reorder -fopenmp -march=native -c /root/anaconda3/envs/colossal-AI/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/cpu_adam.cpp -o cpu_adam.o FAILED: cpu_adam.o c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_gcc" -DPYBIND11_STDLIB="_libstdcpp" -DPYBIND11_BUILD_ABI="_cxxabi1011" -I/root/anaconda3/envs/colossal-AI/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/includes -I/usr/local/cuda/include -isystem /root/anaconda3/envs/colossal-AI/lib/python3.9/site-packages/torch/include -isystem /root/anaconda3/envs/colossal-AI/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/colossal-AI/lib/python3.9/site-packages/torch/include/TH -isystem /root/anaconda3/envs/colossal-AI/lib/python3.9/site-packages/torch/include/THC -isystem /root/anaconda3/envs/colossal-AI/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -std=c++14 -lcudart -lcublas -g -Wno-reorder -fopenmp -march=native -c /root/anaconda3/envs/colossal-AI/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/cpu_adam.cpp -o cpu_adam.o c++: error: unrecognized command line option ‘-std=c++14’ c++: error: unrecognized command line option ‘-std=c++14’ ninja: build stopped: subcommand failed.

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 18372) of binary: /root/anaconda3/envs/colossal-AI/bin/python Traceback (most recent call last): File "/root/anaconda3/envs/colossal-AI/bin/torchrun", line 33, in sys.exit(load_entry_point('torch==1.11.0', 'console_scripts', 'torchrun')()) File "/root/anaconda3/envs/colossal-AI/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper return f(*args, **kwargs) File "/root/anaconda3/envs/colossal-AI/lib/python3.9/site-packages/torch/distributed/run.py", line 724, in main run(args) File "/root/anaconda3/envs/colossal-AI/lib/python3.9/site-packages/torch/distributed/run.py", line 715, in run elastic_launch( File "/root/anaconda3/envs/colossal-AI/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/root/anaconda3/envs/colossal-AI/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2023-07-25_20:27:34 host : node64 rank : 0 (local_rank: 0) exitcode : 1 (pid: 18372) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Error: failed to run torchrun --nproc_per_node=1 --nnodes=1 --node_rank=0 --rdzv_backend=c10d --rdzv_endpoint=127.0.0.1:29500 --rdzv_id=colossalai-default-job train.py -c ./ckpt-fp32 on 127.0.0.1, is localhost: True, exception: Encountered a bad command exit code!

Command: 'cd /home/z00621429/ColossalAI/examples/images/resnet && export MANPATH="/usr/share/lmod/lmod/share/man:/usr/local/Modules/share/man:" XDG_SESSION_ID="4892" HOSTNAME="node64" MODULES_CMD="/usr/local/Modules/libexec/modulecmd.tcl" TERM="xterm" SHELL="/bin/bash" LMOD_ROOT="/usr/share/lmod" HISTSIZE="1000" MODULEPATH_ROOT="/usr/share/modulefiles" SSH_CLIENT="90.253.30.126 56175 22" CONDA_SHLVL="1" CONDA_PROMPT_MODIFIER="(colossal-AI) " LMOD_PKG="/usr/share/lmod/lmod" OLDPWD="/root" LMOD_VERSION="8.2.7" SSH_TTY="/dev/pts/0" http_proxy="http://ptaishanpublic2:[email protected]:8080" USER="root" LMOD_sys="Linux" LS_COLORS="rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=01;05;37;41:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:.tar=01;31:.tgz=01;31:.arc=01;31:.arj=01;31:.taz=01;31:.lha=01;31:.lz4=01;31:.lzh=01;31:.lzma=01;31:.tlz=01;31:.txz=01;31:.tzo=01;31:.t7z=01;31:.zip=01;31:.z=01;31:.Z=01;31:.dz=01;31:.gz=01;31:.lrz=01;31:.lz=01;31:.lzo=01;31:.xz=01;31:.bz2=01;31:.bz=01;31:.tbz=01;31:.tbz2=01;31:.tz=01;31:.deb=01;31:.rpm=01;31:.jar=01;31:.war=01;31:.ear=01;31:.sar=01;31:.rar=01;31:.alz=01;31:.ace=01;31:.zoo=01;31:.cpio=01;31:.7z=01;31:.rz=01;31:.cab=01;31:.jpg=01;35:.jpeg=01;35:.gif=01;35:.bmp=01;35:.pbm=01;35:.pgm=01;35:.ppm=01;35:.tga=01;35:.xbm=01;35:.xpm=01;35:.tif=01;35:.tiff=01;35:.png=01;35:.svg=01;35:.svgz=01;35:.mng=01;35:.pcx=01;35:.mov=01;35:.mpg=01;35:.mpeg=01;35:.m2v=01;35:.mkv=01;35:.webm=01;35:.ogm=01;35:.mp4=01;35:.m4v=01;35:.mp4v=01;35:.vob=01;35:.qt=01;35:.nuv=01;35:.wmv=01;35:.asf=01;35:.rm=01;35:.rmvb=01;35:.flc=01;35:.avi=01;35:.fli=01;35:.flv=01;35:.gl=01;35:.dl=01;35:.xcf=01;35:.xwd=01;35:.yuv=01;35:.cgm=01;35:.emf=01;35:.axv=01;35:.anx=01;35:.ogv=01;35:.ogx=01;35:.aac=01;36:.au=01;36:.flac=01;36:.mid=01;36:.midi=01;36:.mka=01;36:.mp3=01;36:.mpc=01;36:.ogg=01;36:.ra=01;36:.wav=01;36:.axa=01;36:.oga=01;36:.spx=01;36:*.xspf=01;36:" LD_LIBRARY_PATH="/opt/ompi/lib:/opt/openblas/lib:/opt/math/lib:/opt/vasp/fftw/lib:/opt/vasp/scaLapack:/hpcshare/l00645749/libs/compiler/gcc/9.3.0/lib64:/hpcshare/l00645749/cuda-11.0/lib64" CONDA_EXE="/root/anaconda3/bin/conda" ENV="/usr/local/Modules/init/profile.sh" MAIL="/var/spool/mail/root" PATH="/root/anaconda3/envs/colossal-AI/bin:/root/anaconda3/condabin:/usr/local/Modules/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/bin:/opt/ompi/bin:/opt/ucx/bin:/opt/math/bin:/root/bin" CONDA_PREFIX="/root/anaconda3/envs/colossal-AI" PWD="/home/z00621429/ColossalAI/examples/images/resnet" LANG="en_US.UTF-8" MODULEPATH="/etc/modulefiles:/usr/share/modulefiles:/usr/share/modulefiles/Linux:/usr/share/modulefiles/Core:/usr/share/lmod/lmod/modulefiles/Core" LMOD_CMD="/usr/share/lmod/lmod/libexec/lmod" https_proxy="http://ptaishanpublic2:[email protected]:8080" HISTCONTROL="ignoredups" SHLVL="1" HOME="/root" no_proxy="127.0.0.1,.huawei.com,localhost,local,.local" BASH_ENV="/usr/share/lmod/lmod/init/bash" CONDA_PYTHON_EXE="/root/anaconda3/bin/python" LOGNAME="root" XDG_DATA_DIRS="/root/.local/share/flatpak/exports/share:/var/lib/flatpak/exports/share:/usr/local/share:/usr/share" SSH_CONNECTION="90.253.30.126 56175 90.91.33.64 22" MODULESHOME="/usr/share/lmod/lmod" CONDA_DEFAULT_ENV="colossal-AI" LMOD_SETTARG_FULL_SUPPORT="no" LESSOPEN="||/usr/bin/lesspipe.sh %s" XDG_RUNTIME_DIR="/run/user/0" DISPLAY="localhost:10.0" LMOD_DIR="/usr/share/lmod/lmod/libexec" _="/root/anaconda3/envs/colossal-AI/bin/colossalai" && torchrun --nproc_per_node=1 --nnodes=1 --node_rank=0 --rdzv_backend=c10d --rdzv_endpoint=127.0.0.1:29500 --rdzv_id=colossalai-default-job train.py -c ./ckpt-fp32'

Exit code: 1

Stdout: already printed

Stderr: already printed

====== Training on All Nodes ===== 127.0.0.1: failure

====== Stopping All Nodes ===== 127.0.0.1: finish `

Environment

python 3.9 pytorch 1.11 CUDA 11.3.r11.3 gcc (GCC) 7.3.1 20180303 (Red Hat 7.3.1-5)

Aaricis avatar Jul 25 '23 12:07 Aaricis

How about install colossalai with CUDA_EXT=1 pip install colossalai?

flybird11111 avatar Aug 30 '23 08:08 flybird11111

Hi have you found a solution to this problem? I am encountering the same problem with colossalai-0.3.2, torch.2.2.0.dev+cu121 ,cuda12.2

tomyoung903 avatar Sep 21 '23 14:09 tomyoung903

Hi have you found a solution to this problem? I am encountering the same problem with colossalai-0.3.2, torch.2.2.0.dev+cu121 ,cuda12.2

hi ColossalAI does not support Torch 2.0 and above. Torch 1.13.1 is recommended.

flybird11111 avatar Sep 21 '23 15:09 flybird11111

Thanks! I heard colossal-ai was tested on h800. What env (cuda, torch) was used?

tomyoung903 avatar Sep 21 '23 16:09 tomyoung903

Has anybody found a solution to this? This is part of my trace. There is enough disk space. Please help. The environment is a singularity container with nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04 as base and python3.11 and torch2.4.0. Colossalai was built from source without BUILD_EXT=1 and the kernels were being compiled at runtime.

0: [default2]:[rank2]: op_kernel = load( 0: [default1]:[rank1]: File "/usr/local/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 2141, in _import_module_from_library 0: [default2]:[rank2]: ^^^^^ 0: [default1]:[rank1]: module = importlib.util.module_from_spec(spec) 0: [default2]:[rank2]: File "/usr/local/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1312, in load 0: [default1]:[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 0: [default2]:[rank2]: return _jit_compile( 0: [default1]:[rank1]: File "", line 573, in module_from_spec 0: [default2]:[rank2]: ^^^^^^^^^^^^^ 0: [default1]:[rank1]: File "", line 1233, in create_module 0: [default2]:[rank2]: File "/usr/local/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1722, in _jit_compile 0: [default1]:[rank1]: File "", line 241, in _call_with_frames_removed 0: [default2]:[rank2]: _write_ninja_file_and_build_library( 0: [default1]:[rank1]: ImportError: /leonardo/home/userexternal/bbhaskar/.cache/colossalai/torch_extensions/torch2.4_cuda-12.4-0cfcd2c08955132327ef5375c6a934473c823649bc3b1ce6f6990996e74ab92e/cpu_adam_x86.so: cannot open shared object file: No such file or directory 0: [default2]:[rank2]: File "/usr/local/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1834, in _write_ninja_file_and_build_library 0: [default2]:[rank2]: _run_ninja_build( 0: [default2]:[rank2]: File "/usr/local/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 2121, in _run_ninja_build 0: [default2]:[rank2]: raise RuntimeError(message) from e 0: [default2]:[rank2]: RuntimeError: Error building extension 'cpu_adam_x86': [1/2] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam_x86 -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_gcc" -DPYBIND11_STDLIB="_libstdcpp" -DPYBIND11_BUILD_ABI="_cxxabi1011" -I/usr/local/lib/python3.11/site-packages/colossalai/kernel/extensions/csrc -I/usr/local/cuda-12.4/include -isystem /usr/local/lib/python3.11/site-packages/torch/include -isystem /usr/local/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.11/site-packages/torch/include/TH -isystem /usr/local/lib/python3.11/site-packages/torch/include/THC -isystem /usr/local/include/python3.11 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -std=c++14 -std=c++17 -lcudart -lcublas -g -Wno-reorder -fopenmp -march=native -c /usr/local/lib/python3.11/site-packages/colossalai/kernel/extensions/csrc/kernel/x86/cpu_adam.cpp -o cpu_adam.o 0: [default2]:[rank2]: FAILED: cpu_adam.o 0: [default2]:[rank2]: c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam_x86 -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_gcc" -DPYBIND11_STDLIB="_libstdcpp" -DPYBIND11_BUILD_ABI="_cxxabi1011" -I/usr/local/lib/python3.11/site-packages/colossalai/kernel/extensions/csrc -I/usr/local/cuda-12.4/include -isystem /usr/local/lib/python3.11/site-packages/torch/include -isystem /usr/local/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.11/site-packages/torch/include/TH -isystem /usr/local/lib/python3.11/site-packages/torch/include/THC -isystem /usr/local/include/python3.11 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -std=c++14 -std=c++17 -lcudart -lcublas -g -Wno-reorder -fopenmp -march=native -c /usr/local/lib/python3.11/site-packages/colossalai/kernel/extensions/csrc/kernel/x86/cpu_adam.cpp -o cpu_adam.o 0: [default2]:[rank2]: /tmp/ccWeKyt1.s: Assembler messages: 0: [default2]:[rank2]: /tmp/ccWeKyt1.s: Fatal error: can't write 3887 bytes to section .debug_str of cpu_adam.o: 'No space left on device' 0: [default2]:[rank2]: /tmp/ccWeKyt1.s: Fatal error: cpu_adam.o: No space left on device 0: [default2]:[rank2]: ninja: build stopped: subcommand failed.

Bhavani01 avatar Oct 21 '24 23:10 Bhavani01