ColossalAI
ColossalAI copied to clipboard
No module named 'colossalai._C.fused_optim'
No module named 'colossalai._C.fused_optim'
Potentially the same issue as #3041.
but.. nvcc -V nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2022 NVIDIA Corporation Built on Tue_Mar__8_18:18:20_PST_2022 Cuda compilation tools, release 11.6, V11.6.124 Build cuda_11.6.r11.6/compiler.31057947_0
Can you provide more error logs?
03/08/23 16:56:54] INFO colossalai - colossalai - INFO:
/root/anaconda3/envs/python37/lib/python3.7/site-packages/colossalai/context/parallel_context.py:521
set_device
INFO colossalai - colossalai - INFO: process rank 0 is bound to device 0
INFO colossalai - colossalai - INFO:
/root/anaconda3/envs/python37/lib/python3.7/site-packages/colossalai/context/parallel_context.py:557
set_seed
INFO colossalai - colossalai - INFO: initialized seed on rank 0, numpy: 42, python random: 42,
ParallelMode.DATA: 42, ParallelMode.TENSOR: 42,the default parallel seed is ParallelMode.DATA.
INFO colossalai - colossalai - INFO:
/root/anaconda3/envs/python37/lib/python3.7/site-packages/colossalai/initialize.py:120 launch
INFO colossalai - colossalai - INFO: Distributed environment is initialized, data parallel size: 1, pipeline
parallel size: 1, tensor parallel size: 1
No pre-built kernel is found, build and load the cpu_adam kernel during runtime now
Emitting ninja build file /root/.cache/colossalai/torch_extensions/torch1.12_cu11.6/build.ninja... Building extension module cpu_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module cpu_adam... Time to load cpu_adam op: 0.17799901962280273 seconds
No pre-built kernel is found, build and load the fused_optim kernel during runtime now
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/colossalai/torch_extensions/torch1.12_cu11.6/build.ninja...
Building extension module fused_optim...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/6] /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=fused_optim -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1013" -I/root/anaconda3/envs/python37/lib/python3.7/site-packages/colossalai/kernel/cuda_native/csrc/kernels/include -I/usr/local/cuda/include -isystem /root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/include -isystem /root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/include/TH -isystem /root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /root/anaconda3/envs/python37/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 --use_fast_math -lineinfo -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -std=c++14 -c /root/anaconda3/envs/python37/lib/python3.7/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_scale_kernel.cu -o multi_tensor_scale_kernel.cuda.o
FAILED: multi_tensor_scale_kernel.cuda.o
/usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=fused_optim -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1013" -I/root/anaconda3/envs/python37/lib/python3.7/site-packages/colossalai/kernel/cuda_native/csrc/kernels/include -I/usr/local/cuda/include -isystem /root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/include -isystem /root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/include/TH -isystem /root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /root/anaconda3/envs/python37/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 --use_fast_math -lineinfo -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -std=c++14 -c /root/anaconda3/envs/python37/lib/python3.7/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_scale_kernel.cu -o multi_tensor_scale_kernel.cuda.o
/usr/include/c++/11/bits/std_function.h:435:145: error: parameter packs not expanded with ‘...’:
435 | function(Functor&& f)
| ^
/usr/include/c++/11/bits/std_function.h:435:145: note: ‘ArgTypes’
/usr/include/c++/11/bits/std_function.h:530:146: error: parameter packs not expanded with ‘...’:
530 | operator=(Functor&& f)
| ^
/usr/include/c++/11/bits/std_function.h:530:146: note: ‘ArgTypes’
[2/6] /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=fused_optim -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1013" -I/root/anaconda3/envs/python37/lib/python3.7/site-packages/colossalai/kernel/cuda_native/csrc/kernels/include -I/usr/local/cuda/include -isystem /root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/include -isystem /root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/include/TH -isystem /root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /root/anaconda3/envs/python37/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS -D__CUDA_NO_BFLOAT16_CONVERSIONS -D__CUDA_NO_HALF2_OPERATORS --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 --use_fast_math -lineinfo -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -std=c++14 -c /root/anaconda3/envs/python37/lib/python3.7/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_sgd_kernel.cu -o multi_tensor_sgd_kernel.cuda.o
FAILED: multi_tensor_sgd_kernel.cuda.o
/usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=fused_optim -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1013" -I/root/anaconda3/envs/python37/lib/python3.7/site-packages/colossalai/kernel/cuda_native/csrc/kernels/include -I/usr/local/cuda/include -isystem /root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/include -isystem /root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/include/TH -isystem /root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /root/anaconda3/envs/python37/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS -D__CUDA_NO_BFLOAT16_CONVERSIONS -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 --use_fast_math -lineinfo -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -std=c++14 -c /root/anaconda3/envs/python37/lib/python3.7/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_sgd_kernel.cu -o multi_tensor_sgd_kernel.cuda.o
/usr/include/c++/11/bits/std_function.h:435:145: error: parameter packs not expanded with ‘...’:
435 | function(Functor&& f)
| ^
/usr/include/c++/11/bits/std_function.h:435:145: note: ‘ArgTypes’
/usr/include/c++/11/bits/std_function.h:530:146: error: parameter packs not expanded with ‘...’:
530 | operator=(Functor&& f)
| ^
/usr/include/c++/11/bits/std_function.h:530:146: note: ‘ArgTypes’
[3/6] /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=fused_optim -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1013" -I/root/anaconda3/envs/python37/lib/python3.7/site-packages/colossalai/kernel/cuda_native/csrc/kernels/include -I/usr/local/cuda/include -isystem /root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/include -isystem /root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/include/TH -isystem /root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /root/anaconda3/envs/python37/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS -D__CUDA_NO_BFLOAT16_CONVERSIONS -D__CUDA_NO_HALF2_OPERATORS --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 --use_fast_math -lineinfo -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -std=c++14 -c /root/anaconda3/envs/python37/lib/python3.7/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o
FAILED: multi_tensor_adam.cuda.o
/usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=fused_optim -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1013" -I/root/anaconda3/envs/python37/lib/python3.7/site-packages/colossalai/kernel/cuda_native/csrc/kernels/include -I/usr/local/cuda/include -isystem /root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/include -isystem /root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/include/TH -isystem /root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /root/anaconda3/envs/python37/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS -D__CUDA_NO_BFLOAT16_CONVERSIONS -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 --use_fast_math -lineinfo -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -std=c++14 -c /root/anaconda3/envs/python37/lib/python3.7/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o
/usr/include/c++/11/bits/std_function.h:435:145: error: parameter packs not expanded with ‘...’:
435 | function(Functor&& f)
| ^
/usr/include/c++/11/bits/std_function.h:435:145: note: ‘ArgTypes’
/usr/include/c++/11/bits/std_function.h:530:146: error: parameter packs not expanded with ‘...’:
530 | operator=(Functor&& f)
| ^
/usr/include/c++/11/bits/std_function.h:530:146: note: ‘ArgTypes’
[4/6] /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=fused_optim -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1013" -I/root/anaconda3/envs/python37/lib/python3.7/site-packages/colossalai/kernel/cuda_native/csrc/kernels/include -I/usr/local/cuda/include -isystem /root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/include -isystem /root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/include/TH -isystem /root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /root/anaconda3/envs/python37/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS -D__CUDA_NO_BFLOAT16_CONVERSIONS -D__CUDA_NO_HALF2_OPERATORS --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 --use_fast_math -lineinfo -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -std=c++14 -c /root/anaconda3/envs/python37/lib/python3.7/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_l2norm_kernel.cu -o multi_tensor_l2norm_kernel.cuda.o
FAILED: multi_tensor_l2norm_kernel.cuda.o
/usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=fused_optim -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1013" -I/root/anaconda3/envs/python37/lib/python3.7/site-packages/colossalai/kernel/cuda_native/csrc/kernels/include -I/usr/local/cuda/include -isystem /root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/include -isystem /root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/include/TH -isystem /root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /root/anaconda3/envs/python37/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS -D__CUDA_NO_BFLOAT16_CONVERSIONS -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 --use_fast_math -lineinfo -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -std=c++14 -c /root/anaconda3/envs/python37/lib/python3.7/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_l2norm_kernel.cu -o multi_tensor_l2norm_kernel.cuda.o
/usr/include/c++/11/bits/std_function.h:435:145: error: parameter packs not expanded with ‘...’:
435 | function(Functor&& f)
| ^
/usr/include/c++/11/bits/std_function.h:435:145: note: ‘ArgTypes’
/usr/include/c++/11/bits/std_function.h:530:146: error: parameter packs not expanded with ‘...’:
530 | operator=(Functor&& f)
| ^
/usr/include/c++/11/bits/std_function.h:530:146: note: ‘ArgTypes’
[5/6] /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=fused_optim -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1013" -I/root/anaconda3/envs/python37/lib/python3.7/site-packages/colossalai/kernel/cuda_native/csrc/kernels/include -I/usr/local/cuda/include -isystem /root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/include -isystem /root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/include/TH -isystem /root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /root/anaconda3/envs/python37/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS -D__CUDA_NO_BFLOAT16_CONVERSIONS -D__CUDA_NO_HALF2_OPERATORS --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 --use_fast_math -lineinfo -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -std=c++14 -c /root/anaconda3/envs/python37/lib/python3.7/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_lamb.cu -o multi_tensor_lamb.cuda.o
FAILED: multi_tensor_lamb.cuda.o
/usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=fused_optim -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1013" -I/root/anaconda3/envs/python37/lib/python3.7/site-packages/colossalai/kernel/cuda_native/csrc/kernels/include -I/usr/local/cuda/include -isystem /root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/include -isystem /root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/include/TH -isystem /root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /root/anaconda3/envs/python37/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS -D__CUDA_NO_BFLOAT16_CONVERSIONS -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 --use_fast_math -lineinfo -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -std=c++14 -c /root/anaconda3/envs/python37/lib/python3.7/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_lamb.cu -o multi_tensor_lamb.cuda.o
/usr/include/c++/11/bits/std_function.h:435:145: error: parameter packs not expanded with ‘...’:
435 | function(_Functor&& __f)
| ^
/usr/include/c++/11/bits/std_function.h:435:145: note: ‘_ArgTypes’
/usr/include/c++/11/bits/std_function.h:530:146: error: parameter packs not expanded with ‘...’:
530 | operator=(_Functor&& __f)
| ^
/usr/include/c++/11/bits/std_function.h:530:146: note: ‘_ArgTypes’
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
File "/root/anaconda3/envs/python37/lib/python3.7/site-packages/colossalai/kernel/op_builder/builder.py", line 135, in load
op_module = self.import_op()
File "/root/anaconda3/envs/python37/lib/python3.7/site-packages/colossalai/kernel/op_builder/builder.py", line 118, in import_op
return importlib.import_module(self.prebuilt_import_path)
File "/root/anaconda3/envs/python37/lib/python3.7/importlib/init.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1814, in _run_ninja_build env=env) File "/root/anaconda3/envs/python37/lib/python3.7/subprocess.py", line 512, in run output=stdout, stderr=stderr) subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "train_reward_model.py", line 100, in
train(args)
File "train_reward_model.py", line 58, in train
optim = HybridAdam(model.parameters(), lr=5e-5)
File "/root/anaconda3/envs/python37/lib/python3.7/site-packages/colossalai/nn/optimizer/hybrid_adam.py", line 83, in init
fused_optim = FusedOptimBuilder().load()
File "/root/anaconda3/envs/python37/lib/python3.7/site-packages/colossalai/kernel/op_builder/builder.py", line 164, in load
verbose=verbose)
File "/root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1214, in load
keep_intermediates=keep_intermediates)
File "/root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1435, in _jit_compile
is_standalone=is_standalone)
File "/root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1540, in _write_ninja_file_and_build_library
error_prefix=f"Error building extension '{name}'")
File "/root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1824, in _run_ninja_build
raise RuntimeError(message) from e
RuntimeError: Error building extension 'fused_optim'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 51669) of binary: /root/anaconda3/envs/python37/bin/python
Traceback (most recent call last):
File "/root/anaconda3/envs/python37/bin/torchrun", line 8, in
sys.exit(main())
File "/root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper
return f(*args, **kwargs)
File "/root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/distributed/run.py", line 761, in main
run(args)
File "/root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/distributed/run.py", line 755, in run
)(*cmd_args)
File "/root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/anaconda3/envs/python37/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 247, in launch_agent
failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
train_reward_model.py FAILED
Failures: <NO_OTHER_FAILURES>
Root Cause (first observed failure): [0]: time : 2023-03-08_16:57:18 host : jizhi rank : 0 (local_rank: 0) exitcode : 1 (pid: 51669) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Can you upgrade your pytorch to 1.13 (ideally via conda) as per this?
I was using version 1.13, but it didn't work, so I switched to 1.12. i will try again
@Aadedd any progress on that? having the same issue.
my problem has been solved , I upgraded cuda version to 11.7
So @Aadedd as a summary, cuda11.7 + torch11.3 worked for you?
yes