ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

No module named 'colossalai._C.fused_optim'

Open Aadedd opened this issue 2 years ago • 7 comments

No module named 'colossalai._C.fused_optim'

Aadedd avatar Mar 07 '23 11:03 Aadedd

Potentially the same issue as #3041.

JThh avatar Mar 07 '23 12:03 JThh

but.. nvcc -V nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2022 NVIDIA Corporation Built on Tue_Mar__8_18:18:20_PST_2022 Cuda compilation tools, release 11.6, V11.6.124 Build cuda_11.6.r11.6/compiler.31057947_0

Aadedd avatar Mar 08 '23 03:03 Aadedd

Can you provide more error logs?

JThh avatar Mar 08 '23 09:03 JThh

03/08/23 16:56:54] INFO colossalai - colossalai - INFO:
/root/anaconda3/envs/python37/lib/python3.7/site-packages/colossalai/context/parallel_context.py:521
set_device
INFO colossalai - colossalai - INFO: process rank 0 is bound to device 0
INFO colossalai - colossalai - INFO:
/root/anaconda3/envs/python37/lib/python3.7/site-packages/colossalai/context/parallel_context.py:557
set_seed
INFO colossalai - colossalai - INFO: initialized seed on rank 0, numpy: 42, python random: 42,
ParallelMode.DATA: 42, ParallelMode.TENSOR: 42,the default parallel seed is ParallelMode.DATA.
INFO colossalai - colossalai - INFO:
/root/anaconda3/envs/python37/lib/python3.7/site-packages/colossalai/initialize.py:120 launch
INFO colossalai - colossalai - INFO: Distributed environment is initialized, data parallel size: 1, pipeline parallel size: 1, tensor parallel size: 1

No pre-built kernel is found, build and load the cpu_adam kernel during runtime now

Emitting ninja build file /root/.cache/colossalai/torch_extensions/torch1.12_cu11.6/build.ninja... Building extension module cpu_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module cpu_adam... Time to load cpu_adam op: 0.17799901962280273 seconds

No pre-built kernel is found, build and load the fused_optim kernel during runtime now

Detected CUDA files, patching ldflags Emitting ninja build file /root/.cache/colossalai/torch_extensions/torch1.12_cu11.6/build.ninja... Building extension module fused_optim... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) [1/6] /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=fused_optim -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1013" -I/root/anaconda3/envs/python37/lib/python3.7/site-packages/colossalai/kernel/cuda_native/csrc/kernels/include -I/usr/local/cuda/include -isystem /root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/include -isystem /root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/include/TH -isystem /root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /root/anaconda3/envs/python37/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 --use_fast_math -lineinfo -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -std=c++14 -c /root/anaconda3/envs/python37/lib/python3.7/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_scale_kernel.cu -o multi_tensor_scale_kernel.cuda.o FAILED: multi_tensor_scale_kernel.cuda.o /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=fused_optim -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1013" -I/root/anaconda3/envs/python37/lib/python3.7/site-packages/colossalai/kernel/cuda_native/csrc/kernels/include -I/usr/local/cuda/include -isystem /root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/include -isystem /root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/include/TH -isystem /root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /root/anaconda3/envs/python37/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 --use_fast_math -lineinfo -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -std=c++14 -c /root/anaconda3/envs/python37/lib/python3.7/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_scale_kernel.cu -o multi_tensor_scale_kernel.cuda.o /usr/include/c++/11/bits/std_function.h:435:145: error: parameter packs not expanded with ‘...’: 435 | function(Functor&& f) | ^ /usr/include/c++/11/bits/std_function.h:435:145: note: ‘ArgTypes’ /usr/include/c++/11/bits/std_function.h:530:146: error: parameter packs not expanded with ‘...’: 530 | operator=(Functor&& f) | ^ /usr/include/c++/11/bits/std_function.h:530:146: note: ‘ArgTypes’ [2/6] /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=fused_optim -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1013" -I/root/anaconda3/envs/python37/lib/python3.7/site-packages/colossalai/kernel/cuda_native/csrc/kernels/include -I/usr/local/cuda/include -isystem /root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/include -isystem /root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/include/TH -isystem /root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /root/anaconda3/envs/python37/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS -D__CUDA_NO_BFLOAT16_CONVERSIONS -D__CUDA_NO_HALF2_OPERATORS --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 --use_fast_math -lineinfo -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -std=c++14 -c /root/anaconda3/envs/python37/lib/python3.7/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_sgd_kernel.cu -o multi_tensor_sgd_kernel.cuda.o FAILED: multi_tensor_sgd_kernel.cuda.o /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=fused_optim -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1013" -I/root/anaconda3/envs/python37/lib/python3.7/site-packages/colossalai/kernel/cuda_native/csrc/kernels/include -I/usr/local/cuda/include -isystem /root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/include -isystem /root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/include/TH -isystem /root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /root/anaconda3/envs/python37/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS -D__CUDA_NO_BFLOAT16_CONVERSIONS -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 --use_fast_math -lineinfo -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -std=c++14 -c /root/anaconda3/envs/python37/lib/python3.7/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_sgd_kernel.cu -o multi_tensor_sgd_kernel.cuda.o /usr/include/c++/11/bits/std_function.h:435:145: error: parameter packs not expanded with ‘...’: 435 | function(Functor&& f) | ^ /usr/include/c++/11/bits/std_function.h:435:145: note: ‘ArgTypes’ /usr/include/c++/11/bits/std_function.h:530:146: error: parameter packs not expanded with ‘...’: 530 | operator=(Functor&& f) | ^ /usr/include/c++/11/bits/std_function.h:530:146: note: ‘ArgTypes’ [3/6] /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=fused_optim -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1013" -I/root/anaconda3/envs/python37/lib/python3.7/site-packages/colossalai/kernel/cuda_native/csrc/kernels/include -I/usr/local/cuda/include -isystem /root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/include -isystem /root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/include/TH -isystem /root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /root/anaconda3/envs/python37/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS -D__CUDA_NO_BFLOAT16_CONVERSIONS -D__CUDA_NO_HALF2_OPERATORS --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 --use_fast_math -lineinfo -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -std=c++14 -c /root/anaconda3/envs/python37/lib/python3.7/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o FAILED: multi_tensor_adam.cuda.o /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=fused_optim -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1013" -I/root/anaconda3/envs/python37/lib/python3.7/site-packages/colossalai/kernel/cuda_native/csrc/kernels/include -I/usr/local/cuda/include -isystem /root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/include -isystem /root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/include/TH -isystem /root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /root/anaconda3/envs/python37/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS -D__CUDA_NO_BFLOAT16_CONVERSIONS -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 --use_fast_math -lineinfo -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -std=c++14 -c /root/anaconda3/envs/python37/lib/python3.7/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o /usr/include/c++/11/bits/std_function.h:435:145: error: parameter packs not expanded with ‘...’: 435 | function(Functor&& f) | ^ /usr/include/c++/11/bits/std_function.h:435:145: note: ‘ArgTypes’ /usr/include/c++/11/bits/std_function.h:530:146: error: parameter packs not expanded with ‘...’: 530 | operator=(Functor&& f) | ^ /usr/include/c++/11/bits/std_function.h:530:146: note: ‘ArgTypes’ [4/6] /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=fused_optim -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1013" -I/root/anaconda3/envs/python37/lib/python3.7/site-packages/colossalai/kernel/cuda_native/csrc/kernels/include -I/usr/local/cuda/include -isystem /root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/include -isystem /root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/include/TH -isystem /root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /root/anaconda3/envs/python37/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS -D__CUDA_NO_BFLOAT16_CONVERSIONS -D__CUDA_NO_HALF2_OPERATORS --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 --use_fast_math -lineinfo -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -std=c++14 -c /root/anaconda3/envs/python37/lib/python3.7/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_l2norm_kernel.cu -o multi_tensor_l2norm_kernel.cuda.o FAILED: multi_tensor_l2norm_kernel.cuda.o /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=fused_optim -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1013" -I/root/anaconda3/envs/python37/lib/python3.7/site-packages/colossalai/kernel/cuda_native/csrc/kernels/include -I/usr/local/cuda/include -isystem /root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/include -isystem /root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/include/TH -isystem /root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /root/anaconda3/envs/python37/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS -D__CUDA_NO_BFLOAT16_CONVERSIONS -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 --use_fast_math -lineinfo -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -std=c++14 -c /root/anaconda3/envs/python37/lib/python3.7/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_l2norm_kernel.cu -o multi_tensor_l2norm_kernel.cuda.o /usr/include/c++/11/bits/std_function.h:435:145: error: parameter packs not expanded with ‘...’: 435 | function(Functor&& f) | ^ /usr/include/c++/11/bits/std_function.h:435:145: note: ‘ArgTypes’ /usr/include/c++/11/bits/std_function.h:530:146: error: parameter packs not expanded with ‘...’: 530 | operator=(Functor&& f) | ^ /usr/include/c++/11/bits/std_function.h:530:146: note: ‘ArgTypes’ [5/6] /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=fused_optim -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1013" -I/root/anaconda3/envs/python37/lib/python3.7/site-packages/colossalai/kernel/cuda_native/csrc/kernels/include -I/usr/local/cuda/include -isystem /root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/include -isystem /root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/include/TH -isystem /root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /root/anaconda3/envs/python37/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS -D__CUDA_NO_BFLOAT16_CONVERSIONS -D__CUDA_NO_HALF2_OPERATORS --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 --use_fast_math -lineinfo -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -std=c++14 -c /root/anaconda3/envs/python37/lib/python3.7/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_lamb.cu -o multi_tensor_lamb.cuda.o FAILED: multi_tensor_lamb.cuda.o /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=fused_optim -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1013" -I/root/anaconda3/envs/python37/lib/python3.7/site-packages/colossalai/kernel/cuda_native/csrc/kernels/include -I/usr/local/cuda/include -isystem /root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/include -isystem /root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/include/TH -isystem /root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /root/anaconda3/envs/python37/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS -D__CUDA_NO_BFLOAT16_CONVERSIONS -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 --use_fast_math -lineinfo -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -std=c++14 -c /root/anaconda3/envs/python37/lib/python3.7/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_lamb.cu -o multi_tensor_lamb.cuda.o /usr/include/c++/11/bits/std_function.h:435:145: error: parameter packs not expanded with ‘...’: 435 | function(_Functor&& __f) | ^ /usr/include/c++/11/bits/std_function.h:435:145: note: ‘_ArgTypes’ /usr/include/c++/11/bits/std_function.h:530:146: error: parameter packs not expanded with ‘...’: 530 | operator=(_Functor&& __f) | ^ /usr/include/c++/11/bits/std_function.h:530:146: note: ‘_ArgTypes’ ninja: build stopped: subcommand failed. Traceback (most recent call last): File "/root/anaconda3/envs/python37/lib/python3.7/site-packages/colossalai/kernel/op_builder/builder.py", line 135, in load op_module = self.import_op() File "/root/anaconda3/envs/python37/lib/python3.7/site-packages/colossalai/kernel/op_builder/builder.py", line 118, in import_op return importlib.import_module(self.prebuilt_import_path) File "/root/anaconda3/envs/python37/lib/python3.7/importlib/init.py", line 127, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "", line 1006, in _gcd_import File "", line 983, in _find_and_load File "", line 965, in _find_and_load_unlocked ModuleNotFoundError: No module named 'colossalai._C.fused_optim'

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1814, in _run_ninja_build env=env) File "/root/anaconda3/envs/python37/lib/python3.7/subprocess.py", line 512, in run output=stdout, stderr=stderr) subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "train_reward_model.py", line 100, in train(args) File "train_reward_model.py", line 58, in train optim = HybridAdam(model.parameters(), lr=5e-5) File "/root/anaconda3/envs/python37/lib/python3.7/site-packages/colossalai/nn/optimizer/hybrid_adam.py", line 83, in init fused_optim = FusedOptimBuilder().load() File "/root/anaconda3/envs/python37/lib/python3.7/site-packages/colossalai/kernel/op_builder/builder.py", line 164, in load verbose=verbose) File "/root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1214, in load keep_intermediates=keep_intermediates) File "/root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1435, in _jit_compile is_standalone=is_standalone) File "/root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1540, in _write_ninja_file_and_build_library error_prefix=f"Error building extension '{name}'") File "/root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1824, in _run_ninja_build raise RuntimeError(message) from e RuntimeError: Error building extension 'fused_optim' ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 51669) of binary: /root/anaconda3/envs/python37/bin/python Traceback (most recent call last): File "/root/anaconda3/envs/python37/bin/torchrun", line 8, in sys.exit(main()) File "/root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper return f(*args, **kwargs) File "/root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/distributed/run.py", line 761, in main run(args) File "/root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/distributed/run.py", line 755, in run )(*cmd_args) File "/root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 247, in launch_agent failures=result.failures, torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train_reward_model.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2023-03-08_16:57:18 host : jizhi rank : 0 (local_rank: 0) exitcode : 1 (pid: 51669) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Aadedd avatar Mar 08 '23 09:03 Aadedd

Can you upgrade your pytorch to 1.13 (ideally via conda) as per this?

JThh avatar Mar 08 '23 09:03 JThh

I was using version 1.13, but it didn't work, so I switched to 1.12. i will try again

Aadedd avatar Mar 08 '23 10:03 Aadedd

@Aadedd any progress on that? having the same issue.

tassiasP avatar Mar 10 '23 15:03 tassiasP

my problem has been solved , I upgraded cuda version to 11.7

Aadedd avatar Mar 11 '23 07:03 Aadedd

So @Aadedd as a summary, cuda11.7 + torch11.3 worked for you?

JThh avatar Mar 11 '23 08:03 JThh

yes

Aadedd avatar Mar 11 '23 08:03 Aadedd