GPTQ-for-LLaMa
GPTQ-for-LLaMa copied to clipboard
Nvcc fatal : Unsupported gpu architecture 'compute_86'
I get the following error when trying to run setup.py from gptq install. I have a RTX 3090 and followed instructions from this github gist
FAILED: D:/AI/text-generation-webui/repositories/GPTQ-for-LLaMa/build/temp.win-amd64-cpython-310/Release/quant_cuda_kernel.obj C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\bin\nvcc --generate-dependencies-with-compile --dependency-output D:\AI\text-generation-webui\repositories\GPTQ-for-LLaMa\build\temp.win-amd64-cpython-310\Release\quant_cuda_kernel.obj.d --use-local-env -Xcompiler /MD -Xcompiler /wd4819 -Xcompiler /wd4251 -Xcompiler /wd4244 -Xcompiler /wd4267 -Xcompiler /wd4275 -Xcompiler /wd4018 -Xcompiler /wd4190 -Xcompiler /EHsc -Xcudafe --diag_suppress=base_class_has_different_dll_interface -Xcudafe --diag_suppress=field_without_dll_interface -Xcudafe --diag_suppress=dll_interface_conflict_none_assumed -Xcudafe --diag_suppress=dll_interface_conflict_dllexport_assumed -IC:\Users\cruge\miniconda3\envs\textgen\lib\site-packages\torch\include -IC:\Users\cruge\miniconda3\envs\textgen\lib\site-packages\torch\include\torch\csrc\api\include -IC:\Users\cruge\miniconda3\envs\textgen\lib\site-packages\torch\include\TH -IC:\Users\cruge\miniconda3\envs\textgen\lib\site-packages\torch\include\THC "-IC:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\include" -IC:\Users\cruge\miniconda3\envs\textgen\include -IC:\Users\cruge\miniconda3\envs\textgen\Include "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.29.30133\include" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.8\include\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.22000.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.22000.0\shared" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.22000.0\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.22000.0\winrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.22000.0\cppwinrt" -c D:\AI\text-generation-webui\repositories\GPTQ-for-LLaMa\quant_cuda_kernel.cu -o D:\AI\text-generation-webui\repositories\GPTQ-for-LLaMa\build\temp.win-amd64-cpython-310\Release\quant_cuda_kernel.obj -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=quant_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 nvcc fatal : Unsupported gpu architecture 'compute_86' ninja: build stopped: subcommand failed. Traceback (most recent call last): File "C:\Users\cruge\miniconda3\envs\textgen\lib\site-packages\torch\utils\cpp_extension.py", line 1808, in _run_ninja_build subprocess.run( File "C:\Users\cruge\miniconda3\envs\textgen\lib\subprocess.py", line 526, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
Check the cuda version with nvcc -V
Ran that and got "unknown option --V" but running nvcc --version gave me
nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2022 NVIDIA Corporation Built on Wed_Jun__8_16:59:34_Pacific_Daylight_Time_2022 Cuda compilation tools, release 11.7, V11.7.99 Build cuda_11.7.r11.7/compiler.31442593_0
You are currently using cuda 11.0. Install cuda 11.6.
I now get the following error after installing Cuda 11.6.
`FAILED: D:/ai/text-generation-webui/repositories/GPTQ-for-LLaMa/build/temp.win-amd64-cpython-310/Release/quant_cuda_kernel.obj C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.6\bin\nvcc --generate-dependencies-with-compile --dependency-output D:\ai\text-generation-webui\repositories\GPTQ-for-LLaMa\build\temp.win-amd64-cpython-310\Release\quant_cuda_kernel.obj.d --use-local-env -Xcompiler /MD -Xcompiler /wd4819 -Xcompiler /wd4251 -Xcompiler /wd4244 -Xcompiler /wd4267 -Xcompiler /wd4275 -Xcompiler /wd4018 -Xcompiler /wd4190 -Xcompiler /EHsc -Xcudafe --diag_suppress=base_class_has_different_dll_interface -Xcudafe --diag_suppress=field_without_dll_interface -Xcudafe --diag_suppress=dll_interface_conflict_none_assumed -Xcudafe --diag_suppress=dll_interface_conflict_dllexport_assumed -IC:\Users\cruge\miniconda3\envs\textgen\lib\site-packages\torch\include -IC:\Users\cruge\miniconda3\envs\textgen\lib\site-packages\torch\include\torch\csrc\api\include -IC:\Users\cruge\miniconda3\envs\textgen\lib\site-packages\torch\include\TH -IC:\Users\cruge\miniconda3\envs\textgen\lib\site-packages\torch\include\THC "-IC:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.6\include" -IC:\Users\cruge\miniconda3\envs\textgen\include -IC:\Users\cruge\miniconda3\envs\textgen\Include "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.29.30133\include" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.8\include\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.22000.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.22000.0\shared" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.22000.0\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.22000.0\winrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.22000.0\cppwinrt" -c D:\ai\text-generation-webui\repositories\GPTQ-for-LLaMa\quant_cuda_kernel.cu -o D:\ai\text-generation-webui\repositories\GPTQ-for-LLaMa\build\temp.win-amd64-cpython-310\Release\quant_cuda_kernel.obj -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=quant_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 C:/Users/cruge/miniconda3/envs/textgen/lib/site-packages/torch/include\c10/macros/Macros.h(143): warning C4067: unexpected tokens following preprocessor directive - expected a newline C:/Users/cruge/miniconda3/envs/textgen/lib/site-packages/torch/include\c10/macros/Macros.h(143): warning C4067: unexpected tokens following preprocessor directive - expected a newline C:/Users/cruge/miniconda3/envs/textgen/lib/site-packages/torch/include\c10/core/SymInt.h(84): warning #68-D: integer conversion resulted in a change of sign
C:\Users\cruge\miniconda3\envs\textgen\lib\site-packages\torch\include\pybind11\cast.h(1429): error: too few arguments for template template parameter "Tuple" detected during instantiation of class "pybind11::detail::tuple_caster<Tuple, Ts...> [with Tuple=std::pair, Ts=<T1, T2>]" (1507): here
C:\Users\cruge\miniconda3\envs\textgen\lib\site-packages\torch\include\pybind11\cast.h(1503): error: too few arguments for template template parameter "Tuple" detected during instantiation of class "pybind11::detail::tuple_caster<Tuple, Ts...> [with Tuple=std::pair, Ts=<T1, T2>]" (1507): here`
Okay I was able to load the model by following the instructions here
> Finally I managed to get it running. (I still can't compile it, thank you @Brawlence for providing windows wheel) Here is the guide;
1. Install the latest version of text-generation-webui 2. Create directory `text-generation-webui\repositories` and clone GPTQ-for-LLaMa there 3. Stay in the same conda env and install [this wheel](https://github.com/oobabooga/text-generation-webui/files/10947842/quant_cuda-0.0.0-cp310-cp310-win_amd64.whl.zip) with CUDA module. (`pip install quant_cuda-0.0.0-cp310-cp310-win_amd64.whl`) 4. Copy 4bit model to `models` folder and ensure that its name is in following format (example: `llama-30b-4bit.pt`). You still must have the directory with 8bit model in HFv2 format. 5. Start the webui `python .\server.py --model llama-30b --load-in-4bit --no-stream --listen`Tested on Windows 11 with 30B model and RTX 4090.
If you have CUDA errors do the following:
- Download this and this DLLs
- Copy them to
%USERPROFILE%\miniconda3\envs\textgen\lib\site-packages\bitsandbytes - Edit
%USERPROFILE%\miniconda3\envs\textgen\lib\site-packages\bitsandbytes\cuda_setup\main.py - Change
ct.cdll.LoadLibrary(binary_path)toct.cdll.LoadLibrary(str(binary_path))(two times) - Replace
if not torch.cuda.is_available(): return 'libsbitsandbytes_cpu.so', None, None, None, Nonewithif torch.cuda.is_available(): return 'libbitsandbytes_cuda116.dll', None, None, None, None
Originally posted by @Zerogoki00 in https://github.com/qwopqwop200/GPTQ-for-LLaMa/issues/11#issuecomment-1464961225
Can't compile but I can run so I'm fine with the outcome. Will keep trying to compile if people have solutions though.
Since we have changed to use triton now, we do not have this issue.