BELLE icon indicating copy to clipboard operation
BELLE copied to clipboard

运行时出现nvcc fatal : Unsupported gpu architecture 'compute_86'错误

Open shishijier opened this issue 1 year ago • 2 comments

Detected CUDA files, patching ldflags Emitting ninja build file /disk1/shisj/cache/torch_extensions/py39_cu117/fused_adam/build.ninja... Building extension module fused_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) [1/3] c++ -MMD -MF fused_adam_frontend.o.d -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/disk1/shisj/anaconda3/envs/belle/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -I/disk1/shisj/anaconda3/envs/belle/lib/python3.9/site-packages/deepspeed/ops/csrc/adam -isystem /disk1/shisj/anaconda3/envs/belle/lib/python3.9/site-packages/torch/include -isystem /disk1/shisj/anaconda3/envs/belle/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /disk1/shisj/anaconda3/envs/belle/lib/python3.9/site-packages/torch/include/TH -isystem /disk1/shisj/anaconda3/envs/belle/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /disk1/shisj/anaconda3/envs/belle/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++14 -g -Wno-reorder -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -c /disk1/shisj/anaconda3/envs/belle/lib/python3.9/site-packages/deepspeed/ops/csrc/adam/fused_adam_frontend.cpp -o fused_adam_frontend.o FAILED: fused_adam_frontend.o c++ -MMD -MF fused_adam_frontend.o.d -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/disk1/shisj/anaconda3/envs/belle/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -I/disk1/shisj/anaconda3/envs/belle/lib/python3.9/site-packages/deepspeed/ops/csrc/adam -isystem /disk1/shisj/anaconda3/envs/belle/lib/python3.9/site-packages/torch/include -isystem /disk1/shisj/anaconda3/envs/belle/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /disk1/shisj/anaconda3/envs/belle/lib/python3.9/site-packages/torch/include/TH -isystem /disk1/shisj/anaconda3/envs/belle/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /disk1/shisj/anaconda3/envs/belle/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++14 -g -Wno-reorder -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -c /disk1/shisj/anaconda3/envs/belle/lib/python3.9/site-packages/deepspeed/ops/csrc/adam/fused_adam_frontend.cpp -o fused_adam_frontend.o c++: error: unrecognized command line option ‘-std=c++17’ c++: error: unrecognized command line option ‘-std=c++14’ [2/3] /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/disk1/shisj/anaconda3/envs/belle/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -I/disk1/shisj/anaconda3/envs/belle/lib/python3.9/site-packages/deepspeed/ops/csrc/adam -isystem /disk1/shisj/anaconda3/envs/belle/lib/python3.9/site-packages/torch/include -isystem /disk1/shisj/anaconda3/envs/belle/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /disk1/shisj/anaconda3/envs/belle/lib/python3.9/site-packages/torch/include/TH -isystem /disk1/shisj/anaconda3/envs/belle/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /disk1/shisj/anaconda3/envs/belle/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -lineinfo --use_fast_math -gencode=arch=compute_86,code=sm_86 -gencode=arch=compute_86,code=compute_86 -std=c++17 -c /disk1/shisj/anaconda3/envs/belle/lib/python3.9/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o FAILED: multi_tensor_adam.cuda.o /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/disk1/shisj/anaconda3/envs/belle/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -I/disk1/shisj/anaconda3/envs/belle/lib/python3.9/site-packages/deepspeed/ops/csrc/adam -isystem /disk1/shisj/anaconda3/envs/belle/lib/python3.9/site-packages/torch/include -isystem /disk1/shisj/anaconda3/envs/belle/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /disk1/shisj/anaconda3/envs/belle/lib/python3.9/site-packages/torch/include/TH -isystem /disk1/shisj/anaconda3/envs/belle/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /disk1/shisj/anaconda3/envs/belle/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -lineinfo --use_fast_math -gencode=arch=compute_86,code=sm_86 -gencode=arch=compute_86,code=compute_86 -std=c++17 -c /disk1/shisj/anaconda3/envs/belle/lib/python3.9/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o nvcc fatal : Unsupported gpu architecture 'compute_86' ninja: build stopped: subcommand failed. Traceback (most recent call last): File "/disk1/shisj/anaconda3/envs/belle/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1893, in _run_ninja_build subprocess.run( File "/disk1/shisj/anaconda3/envs/belle/lib/python3.9/subprocess.py", line 524, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

网上的各种方都试了,没有解决问题。

shishijier avatar Apr 20 '23 07:04 shishijier

请问您是基于我们的Docker环境运行的嘛。我感觉更像是你的机器的cuda版本、pytorch版本哪里不对应。请问您是什么显卡啊

xianghuisun avatar Apr 20 '23 12:04 xianghuisun

请问您是基于我们的Docker环境运行的嘛。我感觉更像是你的机器的cuda版本、pytorch版本哪里不对应。请问您是什么显卡啊

不是基于Docker运行的。anaconda创建的环境中,cuda版本是11.7,torch版本2.0,显卡是A40 48g。用的centos

shishijier avatar Apr 20 '23 13:04 shishijier

请问您是基于我们的Docker环境运行的嘛。我感觉更像是你的机器的cuda版本、pytorch版本哪里不对应。请问您是什么显卡啊

不是基于Docker运行的。anaconda创建的环境中,cuda版本是11.7,torch版本2.0,显卡是A40 48g。用的centos

您最好基于docker环境运行。

xianghuisun avatar Apr 21 '23 13:04 xianghuisun

请问您是基于我们的Docker环境运行的嘛。我感觉更像是你的机器的cuda版本、pytorch版本哪里不对应。请问您是什么显卡啊

不是基于Docker运行的。anaconda创建的环境中,cuda版本是11.7,torch版本2.0,显卡是A40 48g。用的centos

不然这种报错我们很难找出原因,因为这不是代码的原因

请问您是基于我们的Docker环境运行的嘛。我感觉更像是你的机器的cuda版本、pytorch版本哪里不对应。请问您是什么显卡啊

不是基于Docker运行的。anaconda创建的环境中,cuda版本是11.7,torch版本2.0,显卡是A40 48g。用的centos

我们已经提供了镜像地址,您直接pull下来即可。

xianghuisun avatar Apr 21 '23 13:04 xianghuisun

请问您是基于我们的Docker环境运行的嘛。我感觉更像是你的机器的cuda版本、pytorch版本哪里不对应。请问您是什么显卡啊

不是基于Docker运行的。anaconda创建的环境中,cuda版本是11.7,torch版本2.0,显卡是A40 48g。用的centos

不然这种报错我们很难找出原因,因为这不是代码的原因

请问您是基于我们的Docker环境运行的嘛。我感觉更像是你的机器的cuda版本、pytorch版本哪里不对应。请问您是什么显卡啊

不是基于Docker运行的。anaconda创建的环境中,cuda版本是11.7,torch版本2.0,显卡是A40 48g。用的centos

我们已经提供了镜像地址,您直接pull下来即可。

Initializing TorchBackend in DeepSpeed with backend nccl [cfcd57922398:648 :0:1741] Caught signal 7 (Bus error: nonexistent physical address) [cfcd57922398:646 :0:1740] Caught signal 7 (Bus error: nonexistent physical address) [cfcd57922398:644 :0:1735] Caught signal 7 (Bus error: nonexistent physical address) [cfcd57922398:642 :0:1743] Caught signal 7 (Bus error: nonexistent physical address) [cfcd57922398:641 :0:1742] Caught signal 7 (Bus error: nonexistent physical address) ==== backtrace (tid: 1740) ==== ==== backtrace (tid: 1741) ==== ==== backtrace (tid: 1743) ==== 0 0x0000000000014420 __funlockfile() ???:0 0 0x0000000000014420 __funlockfile() ???:0 ==== backtrace (tid: 1742) ==== 1 0x000000000018bb41 __nss_database_lookup() ???:0 2 0x0000000003bc80ed ncclShmOpen() /pytorch/third_party/nccl/nccl/src/misc/shmutils.cc:52 0 0x0000000000014420 __funlockfile() ???:0 3 0x0000000003bb4ff1 ncclProxyService() /pytorch/third_party/nccl/nccl/src/proxy.cc:897 1 0x000000000018bb41 __nss_database_lookup() ???:0 0 0x0000000000014420 __funlockfile() ???:0 1 0x000000000018bb41 __nss_database_lookup() ???:0

pull下来了,镜像里面没有运行代码吧? 我把本地的代码挂载到镜像中了,在docker中运行了sh文件,还是有问题,就上面发的问题。

shishijier avatar Apr 21 '23 14:04 shishijier

请问您是基于我们的Docker环境运行的嘛。我感觉更像是你的机器的cuda版本、pytorch版本哪里不对应。请问您是什么显卡啊

不是基于Docker运行的。anaconda创建的环境中,cuda版本是11.7,torch版本2.0,显卡是A40 48g。用的centos

不然这种报错我们很难找出原因,因为这不是代码的原因

请问您是基于我们的Docker环境运行的嘛。我感觉更像是你的机器的cuda版本、pytorch版本哪里不对应。请问您是什么显卡啊

不是基于Docker运行的。anaconda创建的环境中,cuda版本是11.7,torch版本2.0,显卡是A40 48g。用的centos

我们已经提供了镜像地址,您直接pull下来即可。

Initializing TorchBackend in DeepSpeed with backend nccl [cfcd57922398:648 :0:1741] Caught signal 7 (Bus error: nonexistent physical address) [cfcd57922398:646 :0:1740] Caught signal 7 (Bus error: nonexistent physical address) [cfcd57922398:644 :0:1735] Caught signal 7 (Bus error: nonexistent physical address) [cfcd57922398:642 :0:1743] Caught signal 7 (Bus error: nonexistent physical address) [cfcd57922398:641 :0:1742] Caught signal 7 (Bus error: nonexistent physical address) ==== backtrace (tid: 1740) ==== ==== backtrace (tid: 1741) ==== ==== backtrace (tid: 1743) ==== 0 0x0000000000014420 __funlockfile() ???:0 0 0x0000000000014420 __funlockfile() ???:0 ==== backtrace (tid: 1742) ==== 1 0x000000000018bb41 __nss_database_lookup() ???:0 2 0x0000000003bc80ed ncclShmOpen() /pytorch/third_party/nccl/nccl/src/misc/shmutils.cc:52 0 0x0000000000014420 __funlockfile() ???:0 3 0x0000000003bb4ff1 ncclProxyService() /pytorch/third_party/nccl/nccl/src/proxy.cc:897 1 0x000000000018bb41 __nss_database_lookup() ???:0 0 0x0000000000014420 __funlockfile() ???:0 1 0x000000000018bb41 __nss_database_lookup() ???:0

pull下来了,镜像里面没有运行代码吧? 我把本地的代码挂载到镜像中了,在docker中运行了sh文件,还是有问题,就上面发的问题。

这个问题解决了么

nullgogo avatar Apr 23 '23 07:04 nullgogo

我这边解决了,原因是torch和和cuda的版本有冲突

nullgogo avatar Apr 23 '23 08:04 nullgogo