BELLE
BELLE copied to clipboard
运行时出现nvcc fatal : Unsupported gpu architecture 'compute_86'错误
Detected CUDA files, patching ldflags Emitting ninja build file /disk1/shisj/cache/torch_extensions/py39_cu117/fused_adam/build.ninja... Building extension module fused_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) [1/3] c++ -MMD -MF fused_adam_frontend.o.d -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/disk1/shisj/anaconda3/envs/belle/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -I/disk1/shisj/anaconda3/envs/belle/lib/python3.9/site-packages/deepspeed/ops/csrc/adam -isystem /disk1/shisj/anaconda3/envs/belle/lib/python3.9/site-packages/torch/include -isystem /disk1/shisj/anaconda3/envs/belle/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /disk1/shisj/anaconda3/envs/belle/lib/python3.9/site-packages/torch/include/TH -isystem /disk1/shisj/anaconda3/envs/belle/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /disk1/shisj/anaconda3/envs/belle/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++14 -g -Wno-reorder -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -c /disk1/shisj/anaconda3/envs/belle/lib/python3.9/site-packages/deepspeed/ops/csrc/adam/fused_adam_frontend.cpp -o fused_adam_frontend.o FAILED: fused_adam_frontend.o c++ -MMD -MF fused_adam_frontend.o.d -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/disk1/shisj/anaconda3/envs/belle/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -I/disk1/shisj/anaconda3/envs/belle/lib/python3.9/site-packages/deepspeed/ops/csrc/adam -isystem /disk1/shisj/anaconda3/envs/belle/lib/python3.9/site-packages/torch/include -isystem /disk1/shisj/anaconda3/envs/belle/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /disk1/shisj/anaconda3/envs/belle/lib/python3.9/site-packages/torch/include/TH -isystem /disk1/shisj/anaconda3/envs/belle/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /disk1/shisj/anaconda3/envs/belle/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++14 -g -Wno-reorder -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -c /disk1/shisj/anaconda3/envs/belle/lib/python3.9/site-packages/deepspeed/ops/csrc/adam/fused_adam_frontend.cpp -o fused_adam_frontend.o c++: error: unrecognized command line option ‘-std=c++17’ c++: error: unrecognized command line option ‘-std=c++14’ [2/3] /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/disk1/shisj/anaconda3/envs/belle/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -I/disk1/shisj/anaconda3/envs/belle/lib/python3.9/site-packages/deepspeed/ops/csrc/adam -isystem /disk1/shisj/anaconda3/envs/belle/lib/python3.9/site-packages/torch/include -isystem /disk1/shisj/anaconda3/envs/belle/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /disk1/shisj/anaconda3/envs/belle/lib/python3.9/site-packages/torch/include/TH -isystem /disk1/shisj/anaconda3/envs/belle/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /disk1/shisj/anaconda3/envs/belle/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -lineinfo --use_fast_math -gencode=arch=compute_86,code=sm_86 -gencode=arch=compute_86,code=compute_86 -std=c++17 -c /disk1/shisj/anaconda3/envs/belle/lib/python3.9/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o FAILED: multi_tensor_adam.cuda.o /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/disk1/shisj/anaconda3/envs/belle/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -I/disk1/shisj/anaconda3/envs/belle/lib/python3.9/site-packages/deepspeed/ops/csrc/adam -isystem /disk1/shisj/anaconda3/envs/belle/lib/python3.9/site-packages/torch/include -isystem /disk1/shisj/anaconda3/envs/belle/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /disk1/shisj/anaconda3/envs/belle/lib/python3.9/site-packages/torch/include/TH -isystem /disk1/shisj/anaconda3/envs/belle/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /disk1/shisj/anaconda3/envs/belle/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -lineinfo --use_fast_math -gencode=arch=compute_86,code=sm_86 -gencode=arch=compute_86,code=compute_86 -std=c++17 -c /disk1/shisj/anaconda3/envs/belle/lib/python3.9/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o nvcc fatal : Unsupported gpu architecture 'compute_86' ninja: build stopped: subcommand failed. Traceback (most recent call last): File "/disk1/shisj/anaconda3/envs/belle/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1893, in _run_ninja_build subprocess.run( File "/disk1/shisj/anaconda3/envs/belle/lib/python3.9/subprocess.py", line 524, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
网上的各种方都试了,没有解决问题。
请问您是基于我们的Docker环境运行的嘛。我感觉更像是你的机器的cuda版本、pytorch版本哪里不对应。请问您是什么显卡啊
请问您是基于我们的Docker环境运行的嘛。我感觉更像是你的机器的cuda版本、pytorch版本哪里不对应。请问您是什么显卡啊
不是基于Docker运行的。anaconda创建的环境中,cuda版本是11.7,torch版本2.0,显卡是A40 48g。用的centos
请问您是基于我们的Docker环境运行的嘛。我感觉更像是你的机器的cuda版本、pytorch版本哪里不对应。请问您是什么显卡啊
不是基于Docker运行的。anaconda创建的环境中,cuda版本是11.7,torch版本2.0,显卡是A40 48g。用的centos
您最好基于docker环境运行。
请问您是基于我们的Docker环境运行的嘛。我感觉更像是你的机器的cuda版本、pytorch版本哪里不对应。请问您是什么显卡啊
不是基于Docker运行的。anaconda创建的环境中,cuda版本是11.7,torch版本2.0,显卡是A40 48g。用的centos
不然这种报错我们很难找出原因,因为这不是代码的原因
请问您是基于我们的Docker环境运行的嘛。我感觉更像是你的机器的cuda版本、pytorch版本哪里不对应。请问您是什么显卡啊
不是基于Docker运行的。anaconda创建的环境中,cuda版本是11.7,torch版本2.0,显卡是A40 48g。用的centos
我们已经提供了镜像地址,您直接pull下来即可。
请问您是基于我们的Docker环境运行的嘛。我感觉更像是你的机器的cuda版本、pytorch版本哪里不对应。请问您是什么显卡啊
不是基于Docker运行的。anaconda创建的环境中,cuda版本是11.7,torch版本2.0,显卡是A40 48g。用的centos
不然这种报错我们很难找出原因,因为这不是代码的原因
请问您是基于我们的Docker环境运行的嘛。我感觉更像是你的机器的cuda版本、pytorch版本哪里不对应。请问您是什么显卡啊
不是基于Docker运行的。anaconda创建的环境中,cuda版本是11.7,torch版本2.0,显卡是A40 48g。用的centos
我们已经提供了镜像地址,您直接pull下来即可。
Initializing TorchBackend in DeepSpeed with backend nccl [cfcd57922398:648 :0:1741] Caught signal 7 (Bus error: nonexistent physical address) [cfcd57922398:646 :0:1740] Caught signal 7 (Bus error: nonexistent physical address) [cfcd57922398:644 :0:1735] Caught signal 7 (Bus error: nonexistent physical address) [cfcd57922398:642 :0:1743] Caught signal 7 (Bus error: nonexistent physical address) [cfcd57922398:641 :0:1742] Caught signal 7 (Bus error: nonexistent physical address) ==== backtrace (tid: 1740) ==== ==== backtrace (tid: 1741) ==== ==== backtrace (tid: 1743) ==== 0 0x0000000000014420 __funlockfile() ???:0 0 0x0000000000014420 __funlockfile() ???:0 ==== backtrace (tid: 1742) ==== 1 0x000000000018bb41 __nss_database_lookup() ???:0 2 0x0000000003bc80ed ncclShmOpen() /pytorch/third_party/nccl/nccl/src/misc/shmutils.cc:52 0 0x0000000000014420 __funlockfile() ???:0 3 0x0000000003bb4ff1 ncclProxyService() /pytorch/third_party/nccl/nccl/src/proxy.cc:897 1 0x000000000018bb41 __nss_database_lookup() ???:0 0 0x0000000000014420 __funlockfile() ???:0 1 0x000000000018bb41 __nss_database_lookup() ???:0
pull下来了,镜像里面没有运行代码吧? 我把本地的代码挂载到镜像中了,在docker中运行了sh文件,还是有问题,就上面发的问题。
请问您是基于我们的Docker环境运行的嘛。我感觉更像是你的机器的cuda版本、pytorch版本哪里不对应。请问您是什么显卡啊
不是基于Docker运行的。anaconda创建的环境中,cuda版本是11.7,torch版本2.0,显卡是A40 48g。用的centos
不然这种报错我们很难找出原因,因为这不是代码的原因
请问您是基于我们的Docker环境运行的嘛。我感觉更像是你的机器的cuda版本、pytorch版本哪里不对应。请问您是什么显卡啊
不是基于Docker运行的。anaconda创建的环境中,cuda版本是11.7,torch版本2.0,显卡是A40 48g。用的centos
我们已经提供了镜像地址,您直接pull下来即可。
Initializing TorchBackend in DeepSpeed with backend nccl [cfcd57922398:648 :0:1741] Caught signal 7 (Bus error: nonexistent physical address) [cfcd57922398:646 :0:1740] Caught signal 7 (Bus error: nonexistent physical address) [cfcd57922398:644 :0:1735] Caught signal 7 (Bus error: nonexistent physical address) [cfcd57922398:642 :0:1743] Caught signal 7 (Bus error: nonexistent physical address) [cfcd57922398:641 :0:1742] Caught signal 7 (Bus error: nonexistent physical address) ==== backtrace (tid: 1740) ==== ==== backtrace (tid: 1741) ==== ==== backtrace (tid: 1743) ==== 0 0x0000000000014420 __funlockfile() ???:0 0 0x0000000000014420 __funlockfile() ???:0 ==== backtrace (tid: 1742) ==== 1 0x000000000018bb41 __nss_database_lookup() ???:0 2 0x0000000003bc80ed ncclShmOpen() /pytorch/third_party/nccl/nccl/src/misc/shmutils.cc:52 0 0x0000000000014420 __funlockfile() ???:0 3 0x0000000003bb4ff1 ncclProxyService() /pytorch/third_party/nccl/nccl/src/proxy.cc:897 1 0x000000000018bb41 __nss_database_lookup() ???:0 0 0x0000000000014420 __funlockfile() ???:0 1 0x000000000018bb41 __nss_database_lookup() ???:0
pull下来了,镜像里面没有运行代码吧? 我把本地的代码挂载到镜像中了,在docker中运行了sh文件,还是有问题,就上面发的问题。
这个问题解决了么
我这边解决了,原因是torch和和cuda的版本有冲突