ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

[BUG]: ModuleNotFoundError: No module named 'colossalai._C.fused_optim'

Open githubtianya opened this issue 1 year ago • 3 comments

🐛 Describe the bug

Traceback (most recent call last): File "/mnt/afs/zhangaqiang/conda_envs/cloud-ai-lab/lib/python3.8/site-packages/colossalai/kernel/op_builder/builder.py", line 161, in load op_module = self.import_op() File "/mnt/afs/zhangaqiang/conda_envs/cloud-ai-lab/lib/python3.8/site-packages/colossalai/kernel/op_builder/builder.py", line 109, in import_op return importlib.import_module(self.prebuilt_import_path) File "/mnt/afs/zhangaqiang/conda_envs/cloud-ai-lab/lib/python3.8/importlib/init.py", line 127, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "", line 1014, in _gcd_import File "", line 991, in _find_and_load File "", line 973, in _find_and_load_unlocked ModuleNotFoundError: No module named 'colossalai._C.fused_optim'

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/mnt/afs/zhangaqiang/conda_envs/cloud-ai-lab/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1900, in _run_ninja_build subprocess.run( File "/mnt/afs/zhangaqiang/conda_envs/cloud-ai-lab/lib/python3.8/subprocess.py", line 516, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "train_sft.py", line 221, in train(args) File "train_sft.py", line 89, in train Traceback (most recent call last): File "/mnt/afs/zhangaqiang/conda_envs/cloud-ai-lab/lib/python3.8/site-packages/colossalai/kernel/op_builder/builder.py", line 161, in load optim = HybridAdam(model.parameters(), lr=args.lr, clipping_norm=1.0) File "/mnt/afs/zhangaqiang/conda_envs/cloud-ai-lab/lib/python3.8/site-packages/colossalai/nn/optimizer/hybrid_adam.py", line 87, in init Traceback (most recent call last): File "/mnt/afs/zhangaqiang/conda_envs/cloud-ai-lab/lib/python3.8/site-packages/colossalai/kernel/op_builder/builder.py", line 161, in load op_module = self.import_op() File "/mnt/afs/zhangaqiang/conda_envs/cloud-ai-lab/lib/python3.8/site-packages/colossalai/kernel/op_builder/builder.py", line 109, in import_op return importlib.import_module(self.prebuilt_import_path) File "/mnt/afs/zhangaqiang/conda_envs/cloud-ai-lab/lib/python3.8/importlib/init.py", line 127, in import_module op_module = self.import_op() File "/mnt/afs/zhangaqiang/conda_envs/cloud-ai-lab/lib/python3.8/site-packages/colossalai/kernel/op_builder/builder.py", line 109, in import_op fused_optim = FusedOptimBuilder().load() File "/mnt/afs/zhangaqiang/conda_envs/cloud-ai-lab/lib/python3.8/site-packages/colossalai/kernel/op_builder/builder.py", line 189, in load return _bootstrap._gcd_import(name[level:], package, level) File "", line 1014, in _gcd_import return importlib.import_module(self.prebuilt_import_path) File "/mnt/afs/zhangaqiang/conda_envs/cloud-ai-lab/lib/python3.8/importlib/init.py", line 127, in import_module op_module = load( File "/mnt/afs/zhangaqiang/conda_envs/cloud-ai-lab/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1284, in load return _bootstrap._gcd_import(name[level:], package, level) File "", line 1014, in _gcd_import return _jit_compile( File "/mnt/afs/zhangaqiang/conda_envs/cloud-ai-lab/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1508, in _jit_compile File "", line 991, in _find_and_load File "", line 991, in _find_and_load _write_ninja_file_and_build_library( File "/mnt/afs/zhangaqiang/conda_envs/cloud-ai-lab/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1623, in _write_ninja_file_and_build_library File "", line 973, in _find_and_load_unlocked _run_ninja_build( File "/mnt/afs/zhangaqiang/conda_envs/cloud-ai-lab/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1916, in _run_ninja_build File "", line 973, in _find_and_load_unlocked ModuleNotFoundError: No module named 'colossalai._C.fused_optim'

Environment

torch 1.13 colossalai==0.3.3 coati==1.0.0

githubtianya avatar Feb 28 '24 08:02 githubtianya

Hi, did u solve this problem? I am having the same issue.

djFatNerd avatar Mar 18 '24 07:03 djFatNerd

Hi,can you try to install colossalai with "BUILD_EXT=1 pip install colossalai"

flybird11111 avatar Mar 18 '24 07:03 flybird11111

Thank you! I uninstalled and re-installed colossalai but It still gives the same error.

Here is the complete error log:

WARNING:torch.distributed.run:


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


/data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/colossalai/shardformer/layer/normalization.py:45: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused layernorm kernel warnings.warn("Please install apex from source (https://github.com/NVIDIA/apex) to use the fused layernorm kernel") /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/colossalai/shardformer/layer/normalization.py:45: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused layernorm kernel warnings.warn("Please install apex from source (https://github.com/NVIDIA/apex) to use the fused layernorm kernel") /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/colossalai/initialize.py:48: UserWarning: config is deprecated and will be removed soon. warnings.warn("config is deprecated and will be removed soon.") [03/18/24 15:27:39] INFO colossalai - colossalai - INFO: /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/colossalai/initialize.py:67 launch
INFO colossalai - colossalai - INFO: Distributed environment is initialized, world size: 2
[2024-03-18 15:27:39] Experiment directory created at ./outputs/019-DiT-XL-2 [2024-03-18 15:27:39] Added key: store_based_barrier_key:2 to store for rank: 0 [2024-03-18 15:27:39] Rank 0: Completed store-based barrier for key:store_based_barrier_key:2 with 2 nodes. [2024-03-18 15:27:39] Added key: store_based_barrier_key:3 to store for rank: 0 [2024-03-18 15:27:39] Rank 0: Completed store-based barrier for key:store_based_barrier_key:3 with 2 nodes. [2024-03-18 15:27:39] Added key: store_based_barrier_key:4 to store for rank: 0 [2024-03-18 15:27:39] Rank 0: Completed store-based barrier for key:store_based_barrier_key:4 with 2 nodes. [2024-03-18 15:27:47] Model params: 642.76 M [extension] Compiling the JIT cpu_adam_x86 kernel during runtime now [extension] Compiling the JIT cpu_adam_x86 kernel during runtime now [extension] Time taken to compile cpu_adam_x86 op: 22.748290538787842 seconds [extension] Compiling the JIT fused_optim_cuda kernel during runtime now Traceback (most recent call last): File "/data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/colossalai/kernel/extensions/cpp_extension.py", line 128, in load op_kernel = self.import_op() File "/data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/colossalai/kernel/extensions/cpp_extension.py", line 58, in import_op return importlib.import_module(self.prebuilt_import_path) File "/data/jd_data/miniconda3/envs/opendit/lib/python3.10/importlib/init.py", line 126, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "", line 1050, in _gcd_import File "", line 1027, in _find_and_load File "", line 1004, in _find_and_load_unlocked ModuleNotFoundError: No module named 'colossalai._C.fused_optim_cuda'

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1900, in _run_ninja_build subprocess.run( File "/data/jd_data/miniconda3/envs/opendit/lib/python3.10/subprocess.py", line 526, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/data/jd_data/ColossalAI/OpenDiT/train.py", line 411, in main(args) File "/data/jd_data/ColossalAI/OpenDiT/train.py", line 206, in main optimizer = HybridAdam( File "/data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/colossalai/nn/optimizer/hybrid_adam.py", line 88, in init fused_optim = FusedOptimizerLoader().load() File "/data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/colossalai/kernel/kernel_loader.py", line 81, in load return usable_exts[0].load() File "/data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/colossalai/kernel/extensions/cpp_extension.py", line 132, in load op_kernel = self.build_jit() File "/data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/colossalai/kernel/extensions/cuda_extension.py", line 79, in build_jit op_kernel = load( File "/data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1284, in load return jit_compile( File "/data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1508, in jit_compile write_ninja_file_and_build_library( File "/data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1623, in write_ninja_file_and_build_library run_ninja_build( File "/data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1916, in run_ninja_build raise RuntimeError(message) from e RuntimeError: Error building extension 'fused_optim_cuda': [1/7] /data/jd_data/miniconda3/envs/opendit/bin/nvcc -DTORCH_EXTENSION_NAME=fused_optim_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/data/jd_data/miniconda3/envs/opendit/include -isystem /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/include -isystem /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/include/TH -isystem /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/include/THC -isystem /data/jd_data/miniconda3/envs/opendit/include -isystem /data/jd_data/miniconda3/envs/opendit/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS -D__CUDA_NO_BFLOAT16_CONVERSIONS -D__CUDA_NO_HALF2_OPERATORS --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 --use_fast_math -lineinfo -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -std=c++14 -c /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/colossalai/kernel/extensions/csrc/cuda/multi_tensor_sgd_kernel.cu -o multi_tensor_sgd_kernel.cuda.o FAILED: multi_tensor_sgd_kernel.cuda.o /data/jd_data/miniconda3/envs/opendit/bin/nvcc -DTORCH_EXTENSION_NAME=fused_optim_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/data/jd_data/miniconda3/envs/opendit/include -isystem /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/include -isystem /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/include/TH -isystem /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/include/THC -isystem /data/jd_data/miniconda3/envs/opendit/include -isystem /data/jd_data/miniconda3/envs/opendit/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 --use_fast_math -lineinfo -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -std=c++14 -c /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/colossalai/kernel/extensions/csrc/cuda/multi_tensor_sgd_kernel.cu -o multi_tensor_sgd_kernel.cuda.o In file included from /usr/include/cuda_runtime.h:83, from : /usr/include/crt/host_config.h:138:2: error: #error -- unsupported GNU version! gcc versions later than 8 are not supported! 138 | #error -- unsupported GNU version! gcc versions later than 8 are not supported! | ^~~~~ [2/7] /data/jd_data/miniconda3/envs/opendit/bin/nvcc -DTORCH_EXTENSION_NAME=fused_optim_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/data/jd_data/miniconda3/envs/opendit/include -isystem /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/include -isystem /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/include/TH -isystem /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/include/THC -isystem /data/jd_data/miniconda3/envs/opendit/include -isystem /data/jd_data/miniconda3/envs/opendit/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 --use_fast_math -lineinfo -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -std=c++14 -c /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/colossalai/kernel/extensions/csrc/cuda/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o FAILED: multi_tensor_adam.cuda.o /data/jd_data/miniconda3/envs/opendit/bin/nvcc -DTORCH_EXTENSION_NAME=fused_optim_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/data/jd_data/miniconda3/envs/opendit/include -isystem /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/include -isystem /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/include/TH -isystem /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/include/THC -isystem /data/jd_data/miniconda3/envs/opendit/include -isystem /data/jd_data/miniconda3/envs/opendit/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 --use_fast_math -lineinfo -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -std=c++14 -c /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/colossalai/kernel/extensions/csrc/cuda/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o In file included from /usr/include/cuda_runtime.h:83, from : /usr/include/crt/host_config.h:138:2: error: #error -- unsupported GNU version! gcc versions later than 8 are not supported! 138 | #error -- unsupported GNU version! gcc versions later than 8 are not supported! | ^~~~~ [3/7] /data/jd_data/miniconda3/envs/opendit/bin/nvcc -DTORCH_EXTENSION_NAME=fused_optim_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/data/jd_data/miniconda3/envs/opendit/include -isystem /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/include -isystem /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/include/TH -isystem /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/include/THC -isystem /data/jd_data/miniconda3/envs/opendit/include -isystem /data/jd_data/miniconda3/envs/opendit/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 --use_fast_math -lineinfo -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -std=c++14 -c /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/colossalai/kernel/extensions/csrc/cuda/multi_tensor_scale_kernel.cu -o multi_tensor_scale_kernel.cuda.o FAILED: multi_tensor_scale_kernel.cuda.o /data/jd_data/miniconda3/envs/opendit/bin/nvcc -DTORCH_EXTENSION_NAME=fused_optim_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/data/jd_data/miniconda3/envs/opendit/include -isystem /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/include -isystem /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/include/TH -isystem /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/include/THC -isystem /data/jd_data/miniconda3/envs/opendit/include -isystem /data/jd_data/miniconda3/envs/opendit/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 --use_fast_math -lineinfo -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -std=c++14 -c /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/colossalai/kernel/extensions/csrc/cuda/multi_tensor_scale_kernel.cu -o multi_tensor_scale_kernel.cuda.o In file included from /usr/include/cuda_runtime.h:83, from : /usr/include/crt/host_config.h:138:2: error: #error -- unsupported GNU version! gcc versions later than 8 are not supported! 138 | #error -- unsupported GNU version! gcc versions later than 8 are not supported! | ^~~~~ [4/7] /data/jd_data/miniconda3/envs/opendit/bin/nvcc -DTORCH_EXTENSION_NAME=fused_optim_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/data/jd_data/miniconda3/envs/opendit/include -isystem /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/include -isystem /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/include/TH -isystem /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/include/THC -isystem /data/jd_data/miniconda3/envs/opendit/include -isystem /data/jd_data/miniconda3/envs/opendit/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 --use_fast_math -lineinfo -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -std=c++14 -c /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/colossalai/kernel/extensions/csrc/cuda/multi_tensor_l2norm_kernel.cu -o multi_tensor_l2norm_kernel.cuda.o FAILED: multi_tensor_l2norm_kernel.cuda.o /data/jd_data/miniconda3/envs/opendit/bin/nvcc -DTORCH_EXTENSION_NAME=fused_optim_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/data/jd_data/miniconda3/envs/opendit/include -isystem /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/include -isystem /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/include/TH -isystem /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/include/THC -isystem /data/jd_data/miniconda3/envs/opendit/include -isystem /data/jd_data/miniconda3/envs/opendit/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 --use_fast_math -lineinfo -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -std=c++14 -c /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/colossalai/kernel/extensions/csrc/cuda/multi_tensor_l2norm_kernel.cu -o multi_tensor_l2norm_kernel.cuda.o In file included from /usr/include/cuda_runtime.h:83, from : /usr/include/crt/host_config.h:138:2: error: #error -- unsupported GNU version! gcc versions later than 8 are not supported! 138 | #error -- unsupported GNU version! gcc versions later than 8 are not supported! | ^~~~~ [5/7] /data/jd_data/miniconda3/envs/opendit/bin/nvcc -DTORCH_EXTENSION_NAME=fused_optim_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/data/jd_data/miniconda3/envs/opendit/include -isystem /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/include -isystem /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/include/TH -isystem /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/include/THC -isystem /data/jd_data/miniconda3/envs/opendit/include -isystem /data/jd_data/miniconda3/envs/opendit/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 --use_fast_math -lineinfo -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -std=c++14 -c /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/colossalai/kernel/extensions/csrc/cuda/multi_tensor_lamb.cu -o multi_tensor_lamb.cuda.o FAILED: multi_tensor_lamb.cuda.o /data/jd_data/miniconda3/envs/opendit/bin/nvcc -DTORCH_EXTENSION_NAME=fused_optim_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/data/jd_data/miniconda3/envs/opendit/include -isystem /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/include -isystem /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/include/TH -isystem /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/include/THC -isystem /data/jd_data/miniconda3/envs/opendit/include -isystem /data/jd_data/miniconda3/envs/opendit/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 --use_fast_math -lineinfo -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -std=c++14 -c /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/colossalai/kernel/extensions/csrc/cuda/multi_tensor_lamb.cu -o multi_tensor_lamb.cuda.o In file included from /usr/include/cuda_runtime.h:83, from : /usr/include/crt/host_config.h:138:2: error: #error -- unsupported GNU version! gcc versions later than 8 are not supported! 138 | #error -- unsupported GNU version! gcc versions later than 8 are not supported! | ^~~~~ [6/7] c++ -MMD -MF colossal_C_frontend.o.d -DTORCH_EXTENSION_NAME=fused_optim_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_gcc" -DPYBIND11_STDLIB="_libstdcpp" -DPYBIND11_BUILD_ABI="_cxxabi1011" -I/data/jd_data/miniconda3/envs/opendit/include -isystem /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/include -isystem /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/include/TH -isystem /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/include/THC -isystem /data/jd_data/miniconda3/envs/opendit/include -isystem /data/jd_data/miniconda3/envs/opendit/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -c /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/colossalai/kernel/extensions/csrc/cuda/colossal_C_frontend.cpp -o colossal_C_frontend.o ninja: build stopped: subcommand failed.

[extension] Time taken to compile cpu_adam_x86 op: 38.66856265068054 seconds [extension] Compiling the JIT fused_optim_cuda kernel during runtime now Traceback (most recent call last): File "/data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/colossalai/kernel/extensions/cpp_extension.py", line 128, in load op_kernel = self.import_op() File "/data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/colossalai/kernel/extensions/cpp_extension.py", line 58, in import_op return importlib.import_module(self.prebuilt_import_path) File "/data/jd_data/miniconda3/envs/opendit/lib/python3.10/importlib/init.py", line 126, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "", line 1050, in _gcd_import File "", line 1027, in _find_and_load File "", line 1004, in _find_and_load_unlocked ModuleNotFoundError: No module named 'colossalai._C.fused_optim_cuda'

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1900, in _run_ninja_build subprocess.run( File "/data/jd_data/miniconda3/envs/opendit/lib/python3.10/subprocess.py", line 526, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/data/jd_data/ColossalAI/OpenDiT/train.py", line 411, in main(args) File "/data/jd_data/ColossalAI/OpenDiT/train.py", line 206, in main optimizer = HybridAdam( File "/data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/colossalai/nn/optimizer/hybrid_adam.py", line 88, in init fused_optim = FusedOptimizerLoader().load() File "/data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/colossalai/kernel/kernel_loader.py", line 81, in load return usable_exts[0].load() File "/data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/colossalai/kernel/extensions/cpp_extension.py", line 132, in load op_kernel = self.build_jit() File "/data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/colossalai/kernel/extensions/cuda_extension.py", line 79, in build_jit op_kernel = load( File "/data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1284, in load return jit_compile( File "/data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1508, in jit_compile write_ninja_file_and_build_library( File "/data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1623, in write_ninja_file_and_build_library run_ninja_build( File "/data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1916, in run_ninja_build raise RuntimeError(message) from e RuntimeError: Error building extension 'fused_optim_cuda': [1/6] /data/jd_data/miniconda3/envs/opendit/bin/nvcc -DTORCH_EXTENSION_NAME=fused_optim_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/data/jd_data/miniconda3/envs/opendit/include -isystem /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/include -isystem /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/include/TH -isystem /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/include/THC -isystem /data/jd_data/miniconda3/envs/opendit/include -isystem /data/jd_data/miniconda3/envs/opendit/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS -D__CUDA_NO_BFLOAT16_CONVERSIONS -D__CUDA_NO_HALF2_OPERATORS --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 --use_fast_math -lineinfo -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -std=c++14 -c /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/colossalai/kernel/extensions/csrc/cuda/multi_tensor_lamb.cu -o multi_tensor_lamb.cuda.o FAILED: multi_tensor_lamb.cuda.o /data/jd_data/miniconda3/envs/opendit/bin/nvcc -DTORCH_EXTENSION_NAME=fused_optim_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/data/jd_data/miniconda3/envs/opendit/include -isystem /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/include -isystem /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/include/TH -isystem /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/include/THC -isystem /data/jd_data/miniconda3/envs/opendit/include -isystem /data/jd_data/miniconda3/envs/opendit/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 --use_fast_math -lineinfo -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -std=c++14 -c /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/colossalai/kernel/extensions/csrc/cuda/multi_tensor_lamb.cu -o multi_tensor_lamb.cuda.o In file included from /usr/include/cuda_runtime.h:83, from : /usr/include/crt/host_config.h:138:2: error: #error -- unsupported GNU version! gcc versions later than 8 are not supported! 138 | #error -- unsupported GNU version! gcc versions later than 8 are not supported! | ^~~~~ [2/6] /data/jd_data/miniconda3/envs/opendit/bin/nvcc -DTORCH_EXTENSION_NAME=fused_optim_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/data/jd_data/miniconda3/envs/opendit/include -isystem /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/include -isystem /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/include/TH -isystem /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/include/THC -isystem /data/jd_data/miniconda3/envs/opendit/include -isystem /data/jd_data/miniconda3/envs/opendit/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 --use_fast_math -lineinfo -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -std=c++14 -c /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/colossalai/kernel/extensions/csrc/cuda/multi_tensor_l2norm_kernel.cu -o multi_tensor_l2norm_kernel.cuda.o FAILED: multi_tensor_l2norm_kernel.cuda.o /data/jd_data/miniconda3/envs/opendit/bin/nvcc -DTORCH_EXTENSION_NAME=fused_optim_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/data/jd_data/miniconda3/envs/opendit/include -isystem /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/include -isystem /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/include/TH -isystem /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/include/THC -isystem /data/jd_data/miniconda3/envs/opendit/include -isystem /data/jd_data/miniconda3/envs/opendit/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 --use_fast_math -lineinfo -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -std=c++14 -c /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/colossalai/kernel/extensions/csrc/cuda/multi_tensor_l2norm_kernel.cu -o multi_tensor_l2norm_kernel.cuda.o In file included from /usr/include/cuda_runtime.h:83, from : /usr/include/crt/host_config.h:138:2: error: #error -- unsupported GNU version! gcc versions later than 8 are not supported! 138 | #error -- unsupported GNU version! gcc versions later than 8 are not supported! | ^~~~~ [3/6] /data/jd_data/miniconda3/envs/opendit/bin/nvcc -DTORCH_EXTENSION_NAME=fused_optim_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/data/jd_data/miniconda3/envs/opendit/include -isystem /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/include -isystem /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/include/TH -isystem /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/include/THC -isystem /data/jd_data/miniconda3/envs/opendit/include -isystem /data/jd_data/miniconda3/envs/opendit/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 --use_fast_math -lineinfo -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -std=c++14 -c /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/colossalai/kernel/extensions/csrc/cuda/multi_tensor_sgd_kernel.cu -o multi_tensor_sgd_kernel.cuda.o FAILED: multi_tensor_sgd_kernel.cuda.o /data/jd_data/miniconda3/envs/opendit/bin/nvcc -DTORCH_EXTENSION_NAME=fused_optim_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/data/jd_data/miniconda3/envs/opendit/include -isystem /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/include -isystem /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/include/TH -isystem /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/include/THC -isystem /data/jd_data/miniconda3/envs/opendit/include -isystem /data/jd_data/miniconda3/envs/opendit/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 --use_fast_math -lineinfo -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -std=c++14 -c /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/colossalai/kernel/extensions/csrc/cuda/multi_tensor_sgd_kernel.cu -o multi_tensor_sgd_kernel.cuda.o In file included from /usr/include/cuda_runtime.h:83, from : /usr/include/crt/host_config.h:138:2: error: #error -- unsupported GNU version! gcc versions later than 8 are not supported! 138 | #error -- unsupported GNU version! gcc versions later than 8 are not supported! | ^~~~~ [4/6] /data/jd_data/miniconda3/envs/opendit/bin/nvcc -DTORCH_EXTENSION_NAME=fused_optim_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/data/jd_data/miniconda3/envs/opendit/include -isystem /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/include -isystem /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/include/TH -isystem /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/include/THC -isystem /data/jd_data/miniconda3/envs/opendit/include -isystem /data/jd_data/miniconda3/envs/opendit/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 --use_fast_math -lineinfo -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -std=c++14 -c /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/colossalai/kernel/extensions/csrc/cuda/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o FAILED: multi_tensor_adam.cuda.o /data/jd_data/miniconda3/envs/opendit/bin/nvcc -DTORCH_EXTENSION_NAME=fused_optim_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/data/jd_data/miniconda3/envs/opendit/include -isystem /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/include -isystem /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/include/TH -isystem /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/include/THC -isystem /data/jd_data/miniconda3/envs/opendit/include -isystem /data/jd_data/miniconda3/envs/opendit/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 --use_fast_math -lineinfo -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -std=c++14 -c /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/colossalai/kernel/extensions/csrc/cuda/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o In file included from /usr/include/cuda_runtime.h:83, from : /usr/include/crt/host_config.h:138:2: error: #error -- unsupported GNU version! gcc versions later than 8 are not supported! 138 | #error -- unsupported GNU version! gcc versions later than 8 are not supported! | ^~~~~ [5/6] /data/jd_data/miniconda3/envs/opendit/bin/nvcc -DTORCH_EXTENSION_NAME=fused_optim_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/data/jd_data/miniconda3/envs/opendit/include -isystem /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/include -isystem /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/include/TH -isystem /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/include/THC -isystem /data/jd_data/miniconda3/envs/opendit/include -isystem /data/jd_data/miniconda3/envs/opendit/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 --use_fast_math -lineinfo -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -std=c++14 -c /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/colossalai/kernel/extensions/csrc/cuda/multi_tensor_scale_kernel.cu -o multi_tensor_scale_kernel.cuda.o FAILED: multi_tensor_scale_kernel.cuda.o /data/jd_data/miniconda3/envs/opendit/bin/nvcc -DTORCH_EXTENSION_NAME=fused_optim_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/data/jd_data/miniconda3/envs/opendit/include -isystem /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/include -isystem /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/include/TH -isystem /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/include/THC -isystem /data/jd_data/miniconda3/envs/opendit/include -isystem /data/jd_data/miniconda3/envs/opendit/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 --use_fast_math -lineinfo -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -std=c++14 -c /data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/colossalai/kernel/extensions/csrc/cuda/multi_tensor_scale_kernel.cu -o multi_tensor_scale_kernel.cuda.o In file included from /usr/include/cuda_runtime.h:83, from : /usr/include/crt/host_config.h:138:2: error: #error -- unsupported GNU version! gcc versions later than 8 are not supported! 138 | #error -- unsupported GNU version! gcc versions later than 8 are not supported! | ^~~~~ ninja: build stopped: subcommand failed.

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1162234) of binary: /data/jd_data/miniconda3/envs/opendit/bin/python Traceback (most recent call last): File "/data/jd_data/miniconda3/envs/opendit/bin/torchrun", line 33, in sys.exit(load_entry_point('torch==1.13.1', 'console_scripts', 'torchrun')()) File "/data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, **kwargs) File "/data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/distributed/run.py", line 762, in main run(args) File "/data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run elastic_launch( File "/data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/data/jd_data/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train.py FAILED

Failures: [1]: time : 2024-03-18_15:28:37 host : zju-ESC8000A-E11 rank : 1 (local_rank: 1) exitcode : 1 (pid: 1162235) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2024-03-18_15:28:37 host : zju-ESC8000A-E11 rank : 0 (local_rank: 0) exitcode : 1 (pid: 1162234) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

djFatNerd avatar Mar 18 '24 07:03 djFatNerd