ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

[BUG]: ModuleNotFoundError: No module named 'colossalai._C.cpu_adam'

Open Zengpr opened this issue 1 year ago โ€ข 4 comments

๐Ÿ› Describe the bug

  • ่ฟ่กŒsh examples/train_sft.sh

image

  • ๆŠฅ้”™ไฟกๆฏๅฆ‚ไธ‹๏ผš

[04/19/23 15:25:30] INFO colossalai - colossalai - INFO: /home/jovyan/work/projects/Example/ColossalAI/venv/lib/python3.8/site-packages/colossalai/context/parallel_context.py:522 set_device
INFO colossalai - colossalai - INFO: process rank 0 is bound to device 0
[04/19/23 15:25:31] INFO colossalai - colossalai - INFO: /home/jovyan/work/projects/Example/ColossalAI/venv/lib/python3.8/site-packages/colossalai/context/parallel_context.py:558 set_seed
INFO colossalai - colossalai - INFO: initialized seed on rank 0, numpy: 42, python random: 42, ParallelMode.DATA: 42, ParallelMode.TENSOR: 42,the default parallel seed is ParallelMode.DATA.
INFO colossalai - colossalai - INFO: /home/jovyan/work/projects/Example/ColossalAI/venv/lib/python3.8/site-packages/colossalai/initialize.py:115 launch
INFO colossalai - colossalai - INFO: Distributed environment is initialized, data parallel size: 1, pipeline parallel size: 1, tensor parallel size: 1
Loading checkpoint shards: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 33/33 [00:15<00:00, 2.18it/s] Traceback (most recent call last): File "/home/jovyan/work/projects/Example/ColossalAI/venv/lib/python3.8/site-packages/colossalai/kernel/op_builder/builder.py", line 161, in load op_module = self.import_op() File "/home/jovyan/work/projects/Example/ColossalAI/venv/lib/python3.8/site-packages/colossalai/kernel/op_builder/builder.py", line 110, in import_op return importlib.import_module(self.prebuilt_import_path) File "/usr/lib/python3.8/importlib/init.py", line 127, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "", line 1014, in _gcd_import File "", line 991, in _find_and_load File "", line 973, in _find_and_load_unlocked ModuleNotFoundError: No module named 'colossalai._C.cpu_adam'

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/jovyan/work/projects/Example/ColossalAI/venv/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1900, in _run_ninja_build subprocess.run( File "/usr/lib/python3.8/subprocess.py", line 516, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "examples/train_sft.py", line 189, in train(args) File "examples/train_sft.py", line 94, in train optim = HybridAdam(model.parameters(), lr=args.lr, clipping_norm=1.0) File "/home/jovyan/work/projects/Example/ColossalAI/venv/lib/python3.8/site-packages/colossalai/nn/optimizer/hybrid_adam.py", line 82, in init cpu_optim = CPUAdamBuilder().load() File "/home/jovyan/work/projects/Example/ColossalAI/venv/lib/python3.8/site-packages/colossalai/kernel/op_builder/builder.py", line 187, in load op_module = load(name=self.name, File "/home/jovyan/work/projects/Example/ColossalAI/venv/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1284, in load return _jit_compile( File "/home/jovyan/work/projects/Example/ColossalAI/venv/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1508, in _jit_compile _write_ninja_file_and_build_library( File "/home/jovyan/work/projects/Example/ColossalAI/venv/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1623, in _write_ninja_file_and_build_library _run_ninja_build( File "/home/jovyan/work/projects/Example/ColossalAI/venv/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1916, in _run_ninja_build raise RuntimeError(message) from e RuntimeError: Error building extension 'cpu_adam': [1/2] /opt/conda/bin/x86_64-conda-linux-gnu-c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_gcc" -DPYBIND11_STDLIB="_libstdcpp" -DPYBIND11_BUILD_ABI="_cxxabi1011" -I/home/jovyan/work/projects/Example/ColossalAI/venv/lib/python3.8/site-packages/colossalai/kernel/cuda_native/csrc/includes -I/usr/local/cuda-11.7/include -isystem /home/jovyan/work/projects/Example/ColossalAI/venv/lib/python3.8/site-packages/torch/include -isystem /home/jovyan/work/projects/Example/ColossalAI/venv/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /home/jovyan/work/projects/Example/ColossalAI/venv/lib/python3.8/site-packages/torch/include/TH -isystem /home/jovyan/work/projects/Example/ColossalAI/venv/lib/python3.8/site-packages/torch/include/THC -isystem /usr/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -std=c++14 -lcudart -lcublas -g -Wno-reorder -fopenmp -march=native -c /home/jovyan/work/projects/Example/ColossalAI/venv/lib/python3.8/site-packages/colossalai/kernel/cuda_native/csrc/cpu_adam.cpp -o cpu_adam.o FAILED: cpu_adam.o /opt/conda/bin/x86_64-conda-linux-gnu-c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_gcc" -DPYBIND11_STDLIB="_libstdcpp" -DPYBIND11_BUILD_ABI="_cxxabi1011" -I/home/jovyan/work/projects/Example/ColossalAI/venv/lib/python3.8/site-packages/colossalai/kernel/cuda_native/csrc/includes -I/usr/local/cuda-11.7/include -isystem /home/jovyan/work/projects/Example/ColossalAI/venv/lib/python3.8/site-packages/torch/include -isystem /home/jovyan/work/projects/Example/ColossalAI/venv/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /home/jovyan/work/projects/Example/ColossalAI/venv/lib/python3.8/site-packages/torch/include/TH -isystem /home/jovyan/work/projects/Example/ColossalAI/venv/lib/python3.8/site-packages/torch/include/THC -isystem /usr/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -std=c++14 -lcudart -lcublas -g -Wno-reorder -fopenmp -march=native -c /home/jovyan/work/projects/Example/ColossalAI/venv/lib/python3.8/site-packages/colossalai/kernel/cuda_native/csrc/cpu_adam.cpp -o cpu_adam.o In file included from /home/jovyan/work/projects/Example/ColossalAI/venv/lib/python3.8/site-packages/torch/include/torch/csrc/Device.h:4, from /home/jovyan/work/projects/Example/ColossalAI/venv/lib/python3.8/site-packages/torch/include/torch/csrc/api/include/torch/python.h:8, from /home/jovyan/work/projects/Example/ColossalAI/venv/lib/python3.8/site-packages/torch/include/torch/extension.h:6, from /home/jovyan/work/projects/Example/ColossalAI/venv/lib/python3.8/site-packages/colossalai/kernel/cuda_native/csrc/cpu_adam.h:29, from /home/jovyan/work/projects/Example/ColossalAI/venv/lib/python3.8/site-packages/colossalai/kernel/cuda_native/csrc/cpu_adam.cpp:22: /home/jovyan/work/projects/Example/ColossalAI/venv/lib/python3.8/site-packages/torch/include/torch/csrc/python_headers.h:12:10: fatal error: Python.h: No such file or directory 12 | #include <Python.h> | ^~~~~~~~~~ compilation terminated. ninja: build stopped: subcommand failed.

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 439899) of binary: /home/jovyan/work/projects/Example/ColossalAI/venv/bin/python Traceback (most recent call last): File "/home/jovyan/work/projects/Example/ColossalAI/venv/bin/torchrun", line 8, in sys.exit(main()) File "/home/jovyan/work/projects/Example/ColossalAI/venv/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, **kwargs) File "/home/jovyan/work/projects/Example/ColossalAI/venv/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main run(args) File "/home/jovyan/work/projects/Example/ColossalAI/venv/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run elastic_launch( File "/home/jovyan/work/projects/Example/ColossalAI/venv/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/jovyan/work/projects/Example/ColossalAI/venv/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

examples/train_sft.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2023-04-19_15:26:57 host : b583500e367e rank : 0 (local_rank: 0) exitcode : 1 (pid: 439899) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

  • image

ๆˆ‘ๅทฒ็ป็œ‹ไบ†่ฟ™ๅ‡ ไธช้ƒฝๆฒกๆœ‰่งฃๅ†ณๆˆ‘็š„้—ฎ้ข˜๏ผŒๅนถไธ”็”จCUDA_EXT=1 pip install . ็ผ–่ฏ‘ๆˆๅŠŸไบ†๏ผŒไฝ†่ฟ˜ๆ˜ฏๆŠฅไธŠ้ข็š„้”™่ฏฏ๏ผŒ่ฏท้—ฎๆ˜ฏไป€ไนˆๅŽŸๅ› ๅ‘ข

Environment

image gcc็‰ˆๆœฌ9.3.0

Zengpr avatar Apr 19 '23 07:04 Zengpr

Did you run pip install colossalai before running pip install . in the folder? They may have introduced confusion. You might want to pip uninstall colossalai.

JThh avatar Apr 20 '23 10:04 JThh

I have the same error while running the language/OPT example using run_gemini.sh with colossalai version 0.3.0. Is there any fix for this?

sreenithi avatar Jun 07 '23 08:06 sreenithi

same issues when running gpt2 example in repo.

jacklanda avatar Jul 20 '23 09:07 jacklanda

I've met the same issue using "CUDA_EXT=1 pip install",

zzb610 avatar Feb 07 '24 03:02 zzb610