llama.cpp
llama.cpp copied to clipboard
For CUDA versions < 11.7 a target CUDA architecture must be explicitly provided via CUDA_DOCKER_ARCH
I tried to build llama.cpp with cublas and got an error.
I used cuda 11.4, then installed 12.4, changed the PATH of nvcc to the new version.
Windows 11.
Logs from w64devkit:
E:/llama.cpp $ make LLAMA_CUBLAS=1
I ccache not found. Consider installing it for faster compilation.
I llama.cpp build info:
I UNAME_S: Windows_NT
I UNAME_P: unknown
I UNAME_M: x86_64
I CFLAGS: -I. -Icommon -D_XOPEN_SOURCE=600 -DNDEBUG -D_WIN32_WINNT=0x602 -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -std=c11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -march=native -mtune=native -Xassembler -muse-unaligned-vector-move -Wdouble-promotion
I CXXFLAGS: -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -Xassembler -muse-unaligned-vector-move -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -DNDEBUG -D_WIN32_WINNT=0x602 -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include
I NVCCFLAGS: -std=c++11 -O3 -use_fast_math --forward-unknown-to-host-compiler -arch=native -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128
I LDFLAGS: -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/usr/lib64 -L/usr/local/cuda/targets/x86_64-linux/lib -L/usr/lib/wsl/lib
I CC: cc (GCC) 13.2.0
I CXX: g++ (GCC) 13.2.0
I NVCC: Build cuda_12.4.r12.4/compiler.33961263_0
grep: unknown option -- P
BusyBox v1.37.0.git (2023-12-05 18:36:56 UTC)
Usage: grep [-HhnlLoqvsrRiwFE] [-m N] [-A|B|C N] { PATTERN | -e PATTERN... | -f FILE... } [FILE]...
Search for PATTERN in FILEs (or stdin)
-H Add 'filename:' prefix
-h Do not add 'filename:' prefix
-n Add 'line_no:' prefix
-l Show only names of files that match
-L Show only names of files that don't match
-c Show only count of matching lines
-o Show only the matching part of line
-q Quiet. Return 0 if PATTERN is found, 1 otherwise
-v Select non-matching lines
-s Suppress open and read errors
-r Recurse
-R Recurse and dereference symlinks
-i Ignore case
-w Match whole words only
-x Match whole lines only
-F PATTERN is a literal (not regexp)
-E PATTERN is an extended regexp
-m N Match up to N times per file
-A N Print N lines of trailing context
-B N Print N lines of leading context
-C N Same as '-A N -B N'
-e PTRN Pattern to match
-f FILE Read pattern from file
Makefile:613: *** I ERROR: For CUDA versions < 11.7 a target CUDA architecture must be explicitly provided via CUDA_DOCKER_ARCH. Stop.
also from w64devkit:
E:/llama.cpp $ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Tue_Feb_27_16:28:36_Pacific_Standard_Time_2024
Cuda compilation tools, release 12.4, V12.4.99
Build cuda_12.4.r12.4/compiler.33961263_0
This solved the issue for me:
sudo apt install nvidia-cuda-toolkit
rm -rf build; cmake -S . -B build -DLLAMA_CUBLAS=ON && cmake --build build --config Release
This solved the issue for me:
sudo apt install nvidia-cuda-toolkit rm -rf build; cmake -S . -B build -DLLAMA_CUBLAS=ON && cmake --build build --config Release
That didn't work for me because I'm using Windows. I would try this using WSL, but I want to use the already installed CUDA for Windows.
i'm having similar issues. please update if you figure something
@mqopi I never used Window so it's just a guess but how about installing nvidia-cuda-toolkit for Windows? Just to be clear, it worked for me for a long time without nvidia-cuda-toolkit, I had to install it after one of the recent commits because it sopped working.
i managed to get it working by installing visual studio rather than just the build tools alone then installing cuda afterwards with vs integration. should work after that. also i used cmake
I believe the issue stems from this part of the error: "grep: unknown option -- P". I assume you are, like me, using w64devkit as the README suggests. It seems like w64devkit's version of grep does not support the --P flag. If you look at line 645 of the Makefile, try changing the following from
CUDA_VERSION := $(shell $(NVCC) --version | grep -oP 'release (\K[0-9]+\.[0-9])')
To:
CUDA_VERSION := $(shell $(NVCC) --version | perl -nle 'print $& if /release ([0-9]+\.[0-9])/')
This uses perl instead, which was inspired by this stack overflow question. On the latest version of w64devkit on my system, I get the following output when I try and parse the version of nvcc:
$ nvcc.exe --version | perl -nle 'print $& if /release ([0-9]+\.[0-9])/'
release 12.3
This seems like a valid output, and I am able to get further along in compilation (I do, however, get a new and hopefully unrelated error: "Cannot find compiler 'cl.exe' in PATH").
I had the same problem. Contributer from this problem helped
make CUDA_DOCKER_ARCH=sm_86 (for video cards from 3060 to 3090ti)
@ntriche >>
CUDA_VERSION := $(shell $(NVCC) --version | perl -nle 'print $& if /release ([0-9]+.[0-9])/')
Also using windows, also using w64devkit as the readme reccomended.
I did not have perl installed, so this line using the grep installed w64devkit worked: grep -oe 'release [0-9]*\.[0-9]*'
@Maintainers, please switch the grep params to use -oe instead of -p, if that works. I don't have access to a *nix system right now so I cant test it there, but I suspect it'll work the same.
s\n: Posting error message here for future searchers, as this issue was difficult for me to find in the first place
Error output
~/Programs/llama.cpp-master $ make -j LLAMA_CUDA=1 I ccache not found. Consider installing it for faster compilation. I llama.cpp build info: I UNAME_S: Windows_NT I UNAME_P: unknown I UNAME_M: x86_64 I CFLAGS: -I. -Icommon -D_XOPEN_SOURCE=600 -DNDEBUG -D_WIN32_WINNT=0x602 -DGGML_USE_CUDA -IC:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v12.4/include -IC:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v12.4/targets/x86_64-linux/include -std=c11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -march=native -mtune=native -Xassembler -muse-unaligned-vector-move -Wdouble-promotion I CXXFLAGS: -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -Xassembler -muse-unaligned-vector-move -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -DNDEBUG -D_WIN32_WINNT=0x602 -DGGML_USE_CUDA -IC:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v12.4/include -IC:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v12.4/targets/x86_64-linux/include I NVCCFLAGS: -std=c++11 -O3 -use_fast_math --forward-unknown-to-host-compiler -arch=native -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 I LDFLAGS: -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -LC:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v12.4/lib64 -L/usr/lib64 -LC:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v12.4/targets/x86_64-linux/lib -L/usr/lib/wsl/lib I CC: cc (GCC) 13.2.0 I CXX: g++ (GCC) 13.2.0 I NVCC: Build cuda_12.4.r12.4/compiler.33961263_0 grep: unknown option -- P BusyBox v1.37.0.git (2023-12-05 17:59:02 UTC)Usage: grep [-HhnlLoqvsrRiwFE] [-m N] [-A|B|C N] { PATTERN | -e PATTERN... | -f FILE... } [FILE]...
Search for PATTERN in FILEs (or stdin)
-H Add 'filename:' prefix
-h Do not add 'filename:' prefix
-n Add 'line_no:' prefix
-l Show only names of files that match
-L Show only names of files that don't match
-c Show only count of matching lines
-o Show only the matching part of line
-q Quiet. Return 0 if PATTERN is found, 1 otherwise
-v Select non-matching lines
-s Suppress open and read errors
-r Recurse
-R Recurse and dereference symlinks
-i Ignore case
-w Match whole words only
-x Match whole lines only
-F PATTERN is a literal (not regexp)
-E PATTERN is an extended regexp
-m N Match up to N times per file
-A N Print N lines of trailing context
-B N Print N lines of leading context
-C N Same as '-A N -B N'
-e PTRN Pattern to match
-f FILE Read pattern from file
Makefile:649: *** I ERROR: For CUDA versions < 11.7 a target CUDA architecture must be explicitly provided via CUDA_DOCKER_ARCH. Stop.
I now get error:
nvcc fatal : A single input file is required for a non-link phase when an outputfile is specified
make: *** [Makefile:482: ggml-cuda.o] Error 1
make: *** Waiting for unfinished jobs....
nvcc fatal : A single input file is required for a non-link phase when an outputfile is specified
make: *** [Makefile:479: ggml-cuda/acc.o] Error 1
nvcc fatal : A single input file is required for a non-link phase when an outputfile is specified
make: *** [Makefile:479: ggml-cuda/alibi.o] Error 1
nvcc fatal : A single input file is required for a non-link phase when an outputfile is specified
make: *** [Makefile:479: ggml-cuda/arange.o] Error 1
but atleast the first issue is no more :). #4119 reccomends using Cmake, I'll investigate that later.
Windows 11 / Nvidia 4060ti 16 GB - same issue -> getting GPU to work.
NOTE The following was done prior:
-
visual studio community must be installed (2022) With c++ ; otherwise all kinds of pain.
-
then install / update Cuda Tool Kit.
-
wheel creation: used powershell instead of a cmd window
manually set: $env:CMAKE_ARGS="-DLLAMA_CUBLAS=on" and $env:FORCE_CMAKE=1
ran : pip uninstall llama-cpp-python followed by pip install llama-cpp-python --force-reinstall --no-cache-dir --verbose
POSSIBLE ADJUST: set CMAKE_ARGS=-DLLAMA_CUBLAS=on and not set CMAKE_ARGS="-DLLAMA_CUBLAS=on"
Did the following (this is standard CPU install) in "dev64" (see windows install using "make"):
On Windows:
Download the latest fortran version of [w64devkit](https://github.com/skeeto/w64devkit/releases).
Extract w64devkit on your pc.
Run w64devkit.exe.
Use the cd command to reach the llama.cpp folder.
From here you can run:
make
Regardless of this step + this step [also ran in w64devkit]: make LLAMA_CUDA=1
Cuda still would not work / exe files would not "compile" with "cuda" so to speak.
Finally this worked:
STEP 1 - Getting GPU to work ( using w64devkit ):
mkdir build cd build cmake .. -DLLAMA_CUDA=ON cmake --build . --config Release
Step 2: Moved "EXE" files from build/bin/release -> to main "llamacpp" Directory. These will overwrite the "old" cpu only EXE files -> and then GPU should be used now/available.
Confirmed this via Imatrix calculations (7B model, 113 chunks) -> CPU 1 hr 15 mins VS GPU 1.27 minutes.
Side note:
-> GPU (with no offloading of layers??) only is now 10 minutes (removed "NGL" flag) VS "old cpu" @ 1 hr 15 minutes. (I kept a copy) (GPU confirmed -> being used)
-> GPU with with -ngl 99 : 1.27 minutes
Maybe back up the "old" exe files / or rename for other usage(s)?
Could moderator / contributor please confirm this step ?
That it is ok/will not cause other issues?
ALTERNATIVE: Download -> https://github.com/ggerganov/llama.cpp/releases/ The " llama-b2694-bin-win-cuda-cu12.2.0-x64.zip " (or 11.7 ... etc) Extract -> These files can be used from a separate folder (IE I setup "_exe" folder in main llamacpp folder.)
Files could be pasted and overwrite "old cpu" exe files - but this may have unintended issues. NOTE: Quick check showed "imatrix" / GPU cal was SLOWER using these files: IE: Same Imatrix -> 1.82 minutes VS 1.27 minutes Could moderator / contributor please confirm this step ?
I had the same problem. Contributer from this problem helped
make CUDA_DOCKER_ARCH=sm_86(for video cards from 3060 to 3090ti)
This worked for me as well for RTX 3060:
make LLAMA_CUDA=1 CUDA_DOCKER_ARCH=sm_86
I went with changing the Makefile L653 to
CUDA_VERSION := $(shell $(NVCC) --version | perl -nle 'print $& if /release ([0-9]+\.[0-9])/')
Have VB with cpp installed, Nvidia CUDA tools installed Win11 RTX3080ti Installed "strawberry" open source version or perl.
I got further, but now am getting this error:
C:/dev/llama.cpp $ make LLAMA_CUDA=1 CUDA_DOCKER_ARCH=sm_86
I ccache found, compilation results will be cached. Disable with LLAMA_NO_CCACHE.
I llama.cpp build info:
I UNAME_S: Windows_NT
I UNAME_P: unknown
I UNAME_M: x86_64
I CFLAGS: -I. -Icommon -D_XOPEN_SOURCE=600 -DNDEBUG -D_WIN32_WINNT=0x602 -DGGML_USE_LLAMAFILE -DGGML_USE_CUDA -IC:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v12.4/include -IC:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v12.4/targets/x86_64-linux/include -std=c11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -march=native -mtune=native -Xassembler -muse-unaligned-vector-move -Wdouble-promotion
I CXXFLAGS: -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -Xassembler -muse-unaligned-vector-move -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -DNDEBUG -D_WIN32_WINNT=0x602 -DGGML_USE_LLAMAFILE -DGGML_USE_CUDA -IC:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v12.4/include -IC:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v12.4/targets/x86_64-linux/include
I NVCCFLAGS: -std=c++11 -O3 -use_fast_math --forward-unknown-to-host-compiler -Wno-deprecated-gpu-targets -arch=sm_86 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128
I LDFLAGS: -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -LC:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v12.4/lib64 -L/usr/lib64 -LC:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v12.4/targets/x86_64-linux/lib -L/usr/lib/wsl/lib
I CC: cc (GCC) 13.2.0
I CXX: x86_64-w64-mingw32-g++ (GCC) 13.2.0
I NVCC: Build cuda_12.4.r12.4/compiler.34097967_0
C:/Strawberry/c/bin/ccache.exe nvcc -std=c++11 -O3 -use_fast_math --forward-unknown-to-host-compiler -Wno-deprecated-gpu-targets -arch=sm_86 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -I. -Icommon -D_XOPEN_SOURCE=600 -DNDEBUG -D_WIN32_WINNT=0x602 -DGGML_USE_LLAMAFILE -DGGML_USE_CUDA -IC:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v12.4/include -IC:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v12.4/targets/x86_64-linux/include -Xcompiler "-std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -Xassembler -muse-unaligned-vector-move -Wno-array-bounds -Wno-pedantic" -c ggml-cuda.cu -o ggml-cuda.o
nvcc fatal : A single input file is required for a non-link phase when an outputfile is specified
make: *** [Makefile:487: ggml-cuda.o] Error 1
Some fatal nvcc error about a single output file. ~~Any suggestions?~~
EDIT:
Worked with Cmake instead of Make :)
This issue was closed because it has been inactive for 14 days since being marked as stale.
rm -rf build; cmake -S . -B build -DLLAMA_CUBLAS=ON && cmake --build build --config Release
I had to use
rm -rf build; cmake -S . -B build -DGGML_CUDA=ON && cmake --build build --config Release
but it worked wonders. Thank you!