llama.cpp
llama.cpp copied to clipboard
does not compile on CUDA 10 anymore
Ever since this got merged: https://github.com/ggerganov/llama.cpp/pull/3370
Makefile needs to be modified because 10's nvcc doesn't have the --forward-unknown-to-host-compiler option, nor the -arch=native cuda10.patch
I second this with CUDA 11 on Ubuntu 22.04. I did not succeed installing CUDA 12 on my Ubuntu 22.04, so I am stuck with 11. I added following ugly hack to my Makefile, which seems to work for my system:
ifdef WEICON_BROKEN
NVCCFLAGS += -arch=compute_86
else
NVCCFLAGS += -arch=native
endif
So there is hope for me building this on windows 8.1 with cublas?
Makefile needs to be modified because 10's nvcc doesn't have the --forward-unknown-to-host-compiler option, nor the -arch=native cuda10.patch
Hi I tried to use your patch to compile on my Nvidia Jetson Nano, but I'm getting some new errors because of it. The jetson runs cuda 10.2, any idea what is wrong?
make LLAMA_CUBLAS=1
I llama.cpp build info:
I UNAME_S: Linux
I UNAME_P: aarch64
I UNAME_M: aarch64
I CFLAGS: -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -std=c11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -Wdouble-promotion -pthread -mcpu=native
I CXXFLAGS: -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -mcpu=native -Wno-array-bounds -Wno-format-truncation
I NVCCFLAGS: --compiler-options=" " -use_fast_math -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128
I LDFLAGS: -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib
I CC: cc (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04) 7.5.0
I CXX: g++ (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04) 7.5.0
cc -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -std=c11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -Wdouble-promotion -pthread -mcpu=native -c ggml.c -o ggml.o
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -mcpu=native -Wno-array-bounds -Wno-format-truncation -c llama.cpp -o llama.o
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -mcpu=native -Wno-array-bounds -Wno-format-truncation -c common/common.cpp -o common.o
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -mcpu=native -Wno-array-bounds -Wno-format-truncation -c common/sampling.cpp -o sampling.o
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -mcpu=native -Wno-array-bounds -Wno-format-truncation -c common/grammar-parser.cpp -o grammar-parser.o
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -mcpu=native -Wno-array-bounds -Wno-format-truncation -c common/build-info.cpp -o build-info.o
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -mcpu=native -Wno-array-bounds -Wno-format-truncation -c common/console.cpp -o console.o
nvcc --compiler-options=" " -use_fast_math -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -c ggml-cuda.cu -o ggml-cuda.o
ggml-cuda.cu(5970): error: identifier "CUBLAS_TF32_TENSOR_OP_MATH" is undefined
ggml-cuda.cu(6617): error: identifier "CUBLAS_COMPUTE_16F" is undefined
ggml-cuda.cu(7552): error: identifier "CUBLAS_COMPUTE_16F" is undefined
ggml-cuda.cu(7586): error: identifier "CUBLAS_COMPUTE_16F" is undefined
4 errors detected in the compilation of "/tmp/tmpxft_00002e62_00000000-6_ggml-cuda.cpp1.ii".
Makefile:440: recipe for target 'ggml-cuda.o' failed
make: *** [ggml-cuda.o] Error 1
I fix follow by this: https://github.com/ggerganov/whisper.cpp/issues/1018
same error:
user@ubuntu:~/llama.cpp$ make LLAMA_CUBLAS=1
I llama.cpp build info:
I UNAME_S: Linux
I UNAME_P: aarch64
I UNAME_M: aarch64
I CFLAGS: -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/usr/local/cuda-10.2/targets/aarch64-linux/include -std=c11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -Wdouble-promotion -pthread -mcpu=armv8.3-a
I CXXFLAGS: -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/usr/local/cuda-10.2/targets/aarch64-linux/include -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -mcpu=armv8.3-a -Wno-array-bounds -Wno-format-truncation
I NVCCFLAGS: --compiler-options=" " -use_fast_math -arch=compute_62 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128
I LDFLAGS: -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/usr/local/cuda-10.2/targets/aarch64-linux/lib
I CC: cc (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04) 7.5.0
I CXX: g++ (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04) 7.5.0
nvcc --compiler-options=" " -use_fast_math -arch=compute_62 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -c ggml-cuda.cu -o ggml-cuda.o
ggml-cuda.cu(6947): error: identifier "CUBLAS_COMPUTE_16F" is undefined
ggml-cuda.cu(7923): error: identifier "CUBLAS_COMPUTE_16F" is undefined
ggml-cuda.cu(7957): error: identifier "CUBLAS_COMPUTE_16F" is undefined
3 errors detected in the compilation of "/tmp/tmpxft_00007758_00000000-6_ggml-cuda.cpp1.ii".
Makefile:457: recipe for target 'ggml-cuda.o' failed
make: *** [ggml-cuda.o] Error 1
0001-fix-old-jetson-compile-error.patch This patch will be useful. I checked the cuda10 documentation and modified some of the code. Now I have successfully on tx2(ub18 jetpack4 cuda10), and the performance is quite good. Need to compile and install a new gcc8, this is the latest gcc supported by cuda10, it can solve the C compiler part of the error
It looks like a promising patch, thanks! I can only test this in the new year, unfortunately, but I'll let know the results by then.
Nice, that patch does fix the compile issue. However, something else is up:
current device: 0
GGML_ASSERT: ggml-cuda.cu:8498: !"cuBLAS error"
But it actually dies at various lines. Hmm I'll check past revisions or something.
Okay, it's broken since https://github.com/ggerganov/llama.cpp/commit/bcc0eb4591bec5ec02fad3f2bdcb1b265052ea56
Which is the "per-layer KV cache + quantum K cache" update.
0001-fix-old-jetson-compile-error.patch This patch will be useful. I checked the cuda10 documentation and modified some of the code. Now I have successfully on tx2(ub18 jetpack4 cuda10), and the performance is quite good. Need to compile and install a new gcc8, this is the latest gcc supported by cuda10, it can solve the C compiler part of the error
I tried the patch on my Nano ubuntu18 with cuda10.2, but it doesn't work for me. I believe i have the setup ok. also updates gcc and g++ to version8.
any ideas what it going wrong?
0001-fix-old-jetson-compile-error.patch This patch will be useful. I checked the cuda10 documentation and modified some of the code. Now I have successfully on tx2(ub18 jetpack4 cuda10), and the performance is quite good. Need to compile and install a new gcc8, this is the latest gcc supported by cuda10, it can solve the C compiler part of the error
I tried the patch on my Nano ubuntu18 with cuda10.2, but it doesn't work for me. I believe i have the setup ok. also updates gcc and g++ to version8.
any ideas what it going wrong?
I tested it on Jetson Tx2 and compiled gcc 8.5 myself. Do not use gcc8 from the apt source , it does not work,I have submitted the content of the patch to the repository, you can directly compile it using the latest code
0001-fix-old-jetson-compile-error.patch This patch will be useful. I checked the cuda10 documentation and modified some of the code. Now I have successfully on tx2(ub18 jetpack4 cuda10), and the performance is quite good. Need to compile and install a new gcc8, this is the latest gcc supported by cuda10, it can solve the C compiler part of the error
I tried the patch on my Nano ubuntu18 with cuda10.2, but it doesn't work for me. I believe i have the setup ok. also updates gcc and g++ to version8. any ideas what it going wrong?
I tested it on Jetson Tx2 and compiled gcc 8.5 myself. Do not use gcc8 from the apt source , it does not work,I have submitted the content of the patch to the repository, you can directly compile it using the latest code
Thanks a lot for the help! I am not sure what repository you mean though. Do you have one with the correctly compiled gcc8.5? i gave it a quick try to do it myself, but it has a some parameters i am not sure how to set.
0001-fix-old-jetson-compile-error.patch This patch will be useful. I checked the cuda10 documentation and modified some of the code. Now I have successfully on tx2(ub18 jetpack4 cuda10), and the performance is quite good. Need to compile and install a new gcc8, this is the latest gcc supported by cuda10, it can solve the C compiler part of the error
I tried the patch on my Nano ubuntu18 with cuda10.2, but it doesn't work for me. I believe i have the setup ok. also updates gcc and g++ to version8. any ideas what it going wrong?
I tested it on Jetson Tx2 and compiled gcc 8.5 myself. Do not use gcc8 from the apt source , it does not work,I have submitted the content of the patch to the repository, you can directly compile it using the latest code
Thanks a lot for the help! I am not sure what repository you mean though. Do you have one with the correctly compiled gcc8.5? i gave it a quick try to do it myself, but it has a some parameters i am not sure how to set.
sudo tar -zvxf gcc-8.5.0.tar.gz --directory=/usr/local/
./contrib/download_prerequisites
mkdir build
cd build
sudo ../configure -enable-checking=release -enable-languages=c,c++
make -j6
make install
gcc -v
This takes a long long long long time and take a lot of space,
You can delete the gcc folder after make install
.
UPDATE: Managed to compile now! Needed to export the gcc installation for make by:
export CC=/usr/local/bin/gcc
export CXX=/usr/local/bin/g++
I've installed gcc 8.5 from source
gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/local/libexec/gcc/aarch64-unknown-linux-gnu/8.5.0/lto-wrapper
Target: aarch64-unknown-linux-gnu
Configured with: ../configure -enable-checking=release -enable-languages=c,c++
Thread model: posix
gcc version 8.5.0 (GCC)
and after removing this line in the Makefile to get rid of an error #MK_CXXFLAGS += -mcpu=native
and using CUDA_DOCKER_ARCH=sm_52
, I still get the following error similar to the one with cmake i've described here#3880:
make LLAMA_CUBLAS=1 CUDA_DOCKER_ARCH=sm_52
I llama.cpp build info:
I UNAME_S: Linux
I UNAME_P: aarch64
I UNAME_M: aarch64
I CFLAGS: -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include -std=c11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -mcpu=native -Wdouble-promotion
I CXXFLAGS: -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation
I NVCCFLAGS: -use_fast_math --forward-unknown-to-host-compiler -Wno-deprecated-gpu-targets -arch=sm_52 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128
I LDFLAGS: -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib -L/usr/local/cuda/targets/aarch64-linux/lib -L/usr/lib/wsl/lib
I CC: cc (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04) 7.5.0
I CXX: g++ (GCC) 8.5.0
cc -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include -std=c11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -mcpu=native -Wdouble-promotion -c ggml.c -o ggml.o
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -c llama.cpp -o llama.o
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -c common/common.cpp -o common.o
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -c common/sampling.cpp -o sampling.o
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -c common/grammar-parser.cpp -o grammar-parser.o
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -c common/build-info.cpp -o build-info.o
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -c common/console.cpp -o console.o
nvcc -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -use_fast_math --forward-unknown-to-host-compiler -Wno-deprecated-gpu-targets -arch=sm_52 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -Wno-pedantic -Xcompiler "-Wno-array-bounds -Wno-format-truncation -Wextra-semi" -c ggml-cuda.cu -o ggml-cuda.o
ggml-cuda.cu(598): warning: function "warp_reduce_sum(half2)" was declared but never referenced
ggml-cuda.cu(619): warning: function "warp_reduce_max(half2)" was declared but never referenced
cc -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include -std=c11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -mcpu=native -Wdouble-promotion -c ggml-alloc.c -o ggml-alloc.o
cc -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include -std=c11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -mcpu=native -Wdouble-promotion -c ggml-backend.c -o ggml-backend.o
cc -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include -std=c11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -mcpu=native -Wdouble-promotion -c ggml-quants.c -o ggml-quants.o
ggml-quants.c: In function ‘ggml_vec_dot_q2_K_q8_K’:
ggml-quants.c:403:27: error: implicit declaration of function ‘vld1q_s16_x2’; did you mean ‘vld1q_s16’? [-Werror=implicit-function-declaration]
#define ggml_vld1q_s16_x2 vld1q_s16_x2
^
ggml-quants.c:3725:41: note: in expansion of macro ‘ggml_vld1q_s16_x2’
const ggml_int16x8x2_t q8sums = ggml_vld1q_s16_x2(y[i].bsums);
^~~~~~~~~~~~~~~~~
ggml-quants.c:403:27: error: invalid initializer
#define ggml_vld1q_s16_x2 vld1q_s16_x2
^
ggml-quants.c:3725:41: note: in expansion of macro ‘ggml_vld1q_s16_x2’
const ggml_int16x8x2_t q8sums = ggml_vld1q_s16_x2(y[i].bsums);
^~~~~~~~~~~~~~~~~
ggml-quants.c:404:27: error: implicit declaration of function ‘vld1q_u8_x2’; did you mean ‘vld1q_u32’? [-Werror=implicit-function-declaration]
#define ggml_vld1q_u8_x2 vld1q_u8_x2
^
ggml-quants.c:3749:46: note: in expansion of macro ‘ggml_vld1q_u8_x2’
const ggml_uint8x16x2_t q2bits = ggml_vld1q_u8_x2(q2); q2 += 32;
^~~~~~~~~~~~~~~~
ggml-quants.c:404:27: error: invalid initializer
#define ggml_vld1q_u8_x2 vld1q_u8_x2
^
ggml-quants.c:3749:46: note: in expansion of macro ‘ggml_vld1q_u8_x2’
const ggml_uint8x16x2_t q2bits = ggml_vld1q_u8_x2(q2); q2 += 32;
^~~~~~~~~~~~~~~~
ggml-quants.c:406:27: error: implicit declaration of function ‘vld1q_s8_x2’; did you mean ‘vld1q_s32’? [-Werror=implicit-function-declaration]
#define ggml_vld1q_s8_x2 vld1q_s8_x2
^
ggml-quants.c:3751:40: note: in expansion of macro ‘ggml_vld1q_s8_x2’
ggml_int8x16x2_t q8bytes = ggml_vld1q_s8_x2(q8); q8 += 32;
^~~~~~~~~~~~~~~~
ggml-quants.c:406:27: error: invalid initializer
#define ggml_vld1q_s8_x2 vld1q_s8_x2
^
ggml-quants.c:3751:40: note: in expansion of macro ‘ggml_vld1q_s8_x2’
ggml_int8x16x2_t q8bytes = ggml_vld1q_s8_x2(q8); q8 += 32;
^~~~~~~~~~~~~~~~
ggml-quants.c:3743:17: error: incompatible types when assigning to type ‘int8x16x2_t {aka struct int8x16x2_t}’ from type ‘int’
q8bytes = ggml_vld1q_s8_x2(q8); q8 += 32;\
^
ggml-quants.c:3757:13: note: in expansion of macro ‘SHIFT_MULTIPLY_ACCUM_WITH_SCALE’
SHIFT_MULTIPLY_ACCUM_WITH_SCALE(2, 2);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
ggml-quants.c:3743:17: error: incompatible types when assigning to type ‘int8x16x2_t {aka struct int8x16x2_t}’ from type ‘int’
q8bytes = ggml_vld1q_s8_x2(q8); q8 += 32;\
^
ggml-quants.c:3758:13: note: in expansion of macro ‘SHIFT_MULTIPLY_ACCUM_WITH_SCALE’
SHIFT_MULTIPLY_ACCUM_WITH_SCALE(4, 4);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
ggml-quants.c:3743:17: error: incompatible types when assigning to type ‘int8x16x2_t {aka struct int8x16x2_t}’ from type ‘int’
q8bytes = ggml_vld1q_s8_x2(q8); q8 += 32;\
^
ggml-quants.c:3759:13: note: in expansion of macro ‘SHIFT_MULTIPLY_ACCUM_WITH_SCALE’
SHIFT_MULTIPLY_ACCUM_WITH_SCALE(6, 6);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
ggml-quants.c: In function ‘ggml_vec_dot_q3_K_q8_K’:
ggml-quants.c:404:27: error: invalid initializer
#define ggml_vld1q_u8_x2 vld1q_u8_x2
^
ggml-quants.c:4365:36: note: in expansion of macro ‘ggml_vld1q_u8_x2’
ggml_uint8x16x2_t qhbits = ggml_vld1q_u8_x2(qh);
^~~~~~~~~~~~~~~~
ggml-quants.c:404:27: error: invalid initializer
#define ggml_vld1q_u8_x2 vld1q_u8_x2
^
ggml-quants.c:4383:46: note: in expansion of macro ‘ggml_vld1q_u8_x2’
const ggml_uint8x16x2_t q3bits = ggml_vld1q_u8_x2(q3); q3 += 32;
^~~~~~~~~~~~~~~~
ggml-quants.c:407:27: error: implicit declaration of function ‘vld1q_s8_x4’; did you mean ‘vld1q_s64’? [-Werror=implicit-function-declaration]
#define ggml_vld1q_s8_x4 vld1q_s8_x4
^
ggml-quants.c:4384:48: note: in expansion of macro ‘ggml_vld1q_s8_x4’
const ggml_int8x16x4_t q8bytes_1 = ggml_vld1q_s8_x4(q8); q8 += 64;
^~~~~~~~~~~~~~~~
ggml-quants.c:407:27: error: invalid initializer
#define ggml_vld1q_s8_x4 vld1q_s8_x4
^
ggml-quants.c:4384:48: note: in expansion of macro ‘ggml_vld1q_s8_x4’
const ggml_int8x16x4_t q8bytes_1 = ggml_vld1q_s8_x4(q8); q8 += 64;
^~~~~~~~~~~~~~~~
ggml-quants.c:407:27: error: invalid initializer
#define ggml_vld1q_s8_x4 vld1q_s8_x4
^
ggml-quants.c:4385:48: note: in expansion of macro ‘ggml_vld1q_s8_x4’
const ggml_int8x16x4_t q8bytes_2 = ggml_vld1q_s8_x4(q8); q8 += 64;
^~~~~~~~~~~~~~~~
ggml-quants.c: In function ‘ggml_vec_dot_q4_K_q8_K’:
ggml-quants.c:404:27: error: invalid initializer
#define ggml_vld1q_u8_x2 vld1q_u8_x2
^
ggml-quants.c:5244:46: note: in expansion of macro ‘ggml_vld1q_u8_x2’
const ggml_uint8x16x2_t q4bits = ggml_vld1q_u8_x2(q4); q4 += 32;
^~~~~~~~~~~~~~~~
ggml-quants.c:5246:21: error: incompatible types when assigning to type ‘int8x16x2_t {aka struct int8x16x2_t}’ from type ‘int’
q8bytes = ggml_vld1q_s8_x2(q8); q8 += 32;
^
ggml-quants.c:5253:21: error: incompatible types when assigning to type ‘int8x16x2_t {aka struct int8x16x2_t}’ from type ‘int’
q8bytes = ggml_vld1q_s8_x2(q8); q8 += 32;
^
ggml-quants.c: In function ‘ggml_vec_dot_q5_K_q8_K’:
ggml-quants.c:404:27: error: invalid initializer
#define ggml_vld1q_u8_x2 vld1q_u8_x2
^
ggml-quants.c:5840:36: note: in expansion of macro ‘ggml_vld1q_u8_x2’
ggml_uint8x16x2_t qhbits = ggml_vld1q_u8_x2(qh);
^~~~~~~~~~~~~~~~
ggml-quants.c:404:27: error: invalid initializer
#define ggml_vld1q_u8_x2 vld1q_u8_x2
^
ggml-quants.c:5848:46: note: in expansion of macro ‘ggml_vld1q_u8_x2’
const ggml_uint8x16x2_t q5bits = ggml_vld1q_u8_x2(q5); q5 += 32;
^~~~~~~~~~~~~~~~
ggml-quants.c:407:27: error: invalid initializer
#define ggml_vld1q_s8_x4 vld1q_s8_x4
^
ggml-quants.c:5849:46: note: in expansion of macro ‘ggml_vld1q_s8_x4’
const ggml_int8x16x4_t q8bytes = ggml_vld1q_s8_x4(q8); q8 += 64;
^~~~~~~~~~~~~~~~
ggml-quants.c: In function ‘ggml_vec_dot_q6_K_q8_K’:
ggml-quants.c:403:27: error: invalid initializer
#define ggml_vld1q_s16_x2 vld1q_s16_x2
^
ggml-quants.c:6506:41: note: in expansion of macro ‘ggml_vld1q_s16_x2’
const ggml_int16x8x2_t q8sums = ggml_vld1q_s16_x2(y[i].bsums);
^~~~~~~~~~~~~~~~~
ggml-quants.c:404:27: error: invalid initializer
#define ggml_vld1q_u8_x2 vld1q_u8_x2
^
ggml-quants.c:6520:40: note: in expansion of macro ‘ggml_vld1q_u8_x2’
ggml_uint8x16x2_t qhbits = ggml_vld1q_u8_x2(qh); qh += 32;
^~~~~~~~~~~~~~~~
ggml-quants.c:405:27: error: implicit declaration of function ‘vld1q_u8_x4’; did you mean ‘vld1q_u64’? [-Werror=implicit-function-declaration]
#define ggml_vld1q_u8_x4 vld1q_u8_x4
^
ggml-quants.c:6521:40: note: in expansion of macro ‘ggml_vld1q_u8_x4’
ggml_uint8x16x4_t q6bits = ggml_vld1q_u8_x4(q6); q6 += 64;
^~~~~~~~~~~~~~~~
ggml-quants.c:405:27: error: invalid initializer
#define ggml_vld1q_u8_x4 vld1q_u8_x4
^
ggml-quants.c:6521:40: note: in expansion of macro ‘ggml_vld1q_u8_x4’
ggml_uint8x16x4_t q6bits = ggml_vld1q_u8_x4(q6); q6 += 64;
^~~~~~~~~~~~~~~~
ggml-quants.c:407:27: error: invalid initializer
#define ggml_vld1q_s8_x4 vld1q_s8_x4
^
ggml-quants.c:6522:40: note: in expansion of macro ‘ggml_vld1q_s8_x4’
ggml_int8x16x4_t q8bytes = ggml_vld1q_s8_x4(q8); q8 += 64;
^~~~~~~~~~~~~~~~
ggml-quants.c:6547:21: error: incompatible types when assigning to type ‘int8x16x4_t {aka struct int8x16x4_t}’ from type ‘int’
q8bytes = ggml_vld1q_s8_x4(q8); q8 += 64;
^
ggml-quants.c: In function ‘ggml_vec_dot_iq2_xxs_q8_K’:
ggml-quants.c:7264:17: error: incompatible types when assigning to type ‘int8x16x4_t {aka struct int8x16x4_t}’ from type ‘int’
q8b = ggml_vld1q_s8_x4(q8); q8 += 64;
^
cc1: some warnings being treated as errors
Makefile:552: recipe for target 'ggml-quants.o' failed
make: *** [ggml-quants.o] Error 1
UPDATE: Managed to compile now! Needed to export the gcc installation for make by:
export CC=/usr/local/bin/gcc export CXX=/usr/local/bin/g++
I've installed gcc 8.5 from source
gcc -v Using built-in specs. COLLECT_GCC=gcc COLLECT_LTO_WRAPPER=/usr/local/libexec/gcc/aarch64-unknown-linux-gnu/8.5.0/lto-wrapper Target: aarch64-unknown-linux-gnu Configured with: ../configure -enable-checking=release -enable-languages=c,c++ Thread model: posix gcc version 8.5.0 (GCC)
and after removing this line in the Makefile to get rid of an error
#MK_CXXFLAGS += -mcpu=native
and usingCUDA_DOCKER_ARCH=sm_52
, I still get the following error similar to the one with cmake i've described here#3880:make LLAMA_CUBLAS=1 CUDA_DOCKER_ARCH=sm_52 I llama.cpp build info: I UNAME_S: Linux I UNAME_P: aarch64 I UNAME_M: aarch64 I CFLAGS: -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include -std=c11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -mcpu=native -Wdouble-promotion I CXXFLAGS: -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation I NVCCFLAGS: -use_fast_math --forward-unknown-to-host-compiler -Wno-deprecated-gpu-targets -arch=sm_52 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 I LDFLAGS: -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib -L/usr/local/cuda/targets/aarch64-linux/lib -L/usr/lib/wsl/lib I CC: cc (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04) 7.5.0 I CXX: g++ (GCC) 8.5.0 cc -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include -std=c11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -mcpu=native -Wdouble-promotion -c ggml.c -o ggml.o g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -c llama.cpp -o llama.o g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -c common/common.cpp -o common.o g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -c common/sampling.cpp -o sampling.o g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -c common/grammar-parser.cpp -o grammar-parser.o g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -c common/build-info.cpp -o build-info.o g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -c common/console.cpp -o console.o nvcc -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -use_fast_math --forward-unknown-to-host-compiler -Wno-deprecated-gpu-targets -arch=sm_52 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -Wno-pedantic -Xcompiler "-Wno-array-bounds -Wno-format-truncation -Wextra-semi" -c ggml-cuda.cu -o ggml-cuda.o ggml-cuda.cu(598): warning: function "warp_reduce_sum(half2)" was declared but never referenced ggml-cuda.cu(619): warning: function "warp_reduce_max(half2)" was declared but never referenced cc -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include -std=c11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -mcpu=native -Wdouble-promotion -c ggml-alloc.c -o ggml-alloc.o cc -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include -std=c11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -mcpu=native -Wdouble-promotion -c ggml-backend.c -o ggml-backend.o cc -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include -std=c11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -mcpu=native -Wdouble-promotion -c ggml-quants.c -o ggml-quants.o ggml-quants.c: In function ‘ggml_vec_dot_q2_K_q8_K’: ggml-quants.c:403:27: error: implicit declaration of function ‘vld1q_s16_x2’; did you mean ‘vld1q_s16’? [-Werror=implicit-function-declaration] #define ggml_vld1q_s16_x2 vld1q_s16_x2 ^ ggml-quants.c:3725:41: note: in expansion of macro ‘ggml_vld1q_s16_x2’ const ggml_int16x8x2_t q8sums = ggml_vld1q_s16_x2(y[i].bsums); ^~~~~~~~~~~~~~~~~ ggml-quants.c:403:27: error: invalid initializer #define ggml_vld1q_s16_x2 vld1q_s16_x2 ^ ggml-quants.c:3725:41: note: in expansion of macro ‘ggml_vld1q_s16_x2’ const ggml_int16x8x2_t q8sums = ggml_vld1q_s16_x2(y[i].bsums); ^~~~~~~~~~~~~~~~~ ggml-quants.c:404:27: error: implicit declaration of function ‘vld1q_u8_x2’; did you mean ‘vld1q_u32’? [-Werror=implicit-function-declaration] #define ggml_vld1q_u8_x2 vld1q_u8_x2 ^ ggml-quants.c:3749:46: note: in expansion of macro ‘ggml_vld1q_u8_x2’ const ggml_uint8x16x2_t q2bits = ggml_vld1q_u8_x2(q2); q2 += 32; ^~~~~~~~~~~~~~~~ ggml-quants.c:404:27: error: invalid initializer #define ggml_vld1q_u8_x2 vld1q_u8_x2 ^ ggml-quants.c:3749:46: note: in expansion of macro ‘ggml_vld1q_u8_x2’ const ggml_uint8x16x2_t q2bits = ggml_vld1q_u8_x2(q2); q2 += 32; ^~~~~~~~~~~~~~~~ ggml-quants.c:406:27: error: implicit declaration of function ‘vld1q_s8_x2’; did you mean ‘vld1q_s32’? [-Werror=implicit-function-declaration] #define ggml_vld1q_s8_x2 vld1q_s8_x2 ^ ggml-quants.c:3751:40: note: in expansion of macro ‘ggml_vld1q_s8_x2’ ggml_int8x16x2_t q8bytes = ggml_vld1q_s8_x2(q8); q8 += 32; ^~~~~~~~~~~~~~~~ ggml-quants.c:406:27: error: invalid initializer #define ggml_vld1q_s8_x2 vld1q_s8_x2 ^ ggml-quants.c:3751:40: note: in expansion of macro ‘ggml_vld1q_s8_x2’ ggml_int8x16x2_t q8bytes = ggml_vld1q_s8_x2(q8); q8 += 32; ^~~~~~~~~~~~~~~~ ggml-quants.c:3743:17: error: incompatible types when assigning to type ‘int8x16x2_t {aka struct int8x16x2_t}’ from type ‘int’ q8bytes = ggml_vld1q_s8_x2(q8); q8 += 32;\ ^ ggml-quants.c:3757:13: note: in expansion of macro ‘SHIFT_MULTIPLY_ACCUM_WITH_SCALE’ SHIFT_MULTIPLY_ACCUM_WITH_SCALE(2, 2); ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ggml-quants.c:3743:17: error: incompatible types when assigning to type ‘int8x16x2_t {aka struct int8x16x2_t}’ from type ‘int’ q8bytes = ggml_vld1q_s8_x2(q8); q8 += 32;\ ^ ggml-quants.c:3758:13: note: in expansion of macro ‘SHIFT_MULTIPLY_ACCUM_WITH_SCALE’ SHIFT_MULTIPLY_ACCUM_WITH_SCALE(4, 4); ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ggml-quants.c:3743:17: error: incompatible types when assigning to type ‘int8x16x2_t {aka struct int8x16x2_t}’ from type ‘int’ q8bytes = ggml_vld1q_s8_x2(q8); q8 += 32;\ ^ ggml-quants.c:3759:13: note: in expansion of macro ‘SHIFT_MULTIPLY_ACCUM_WITH_SCALE’ SHIFT_MULTIPLY_ACCUM_WITH_SCALE(6, 6); ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ggml-quants.c: In function ‘ggml_vec_dot_q3_K_q8_K’: ggml-quants.c:404:27: error: invalid initializer #define ggml_vld1q_u8_x2 vld1q_u8_x2 ^ ggml-quants.c:4365:36: note: in expansion of macro ‘ggml_vld1q_u8_x2’ ggml_uint8x16x2_t qhbits = ggml_vld1q_u8_x2(qh); ^~~~~~~~~~~~~~~~ ggml-quants.c:404:27: error: invalid initializer #define ggml_vld1q_u8_x2 vld1q_u8_x2 ^ ggml-quants.c:4383:46: note: in expansion of macro ‘ggml_vld1q_u8_x2’ const ggml_uint8x16x2_t q3bits = ggml_vld1q_u8_x2(q3); q3 += 32; ^~~~~~~~~~~~~~~~ ggml-quants.c:407:27: error: implicit declaration of function ‘vld1q_s8_x4’; did you mean ‘vld1q_s64’? [-Werror=implicit-function-declaration] #define ggml_vld1q_s8_x4 vld1q_s8_x4 ^ ggml-quants.c:4384:48: note: in expansion of macro ‘ggml_vld1q_s8_x4’ const ggml_int8x16x4_t q8bytes_1 = ggml_vld1q_s8_x4(q8); q8 += 64; ^~~~~~~~~~~~~~~~ ggml-quants.c:407:27: error: invalid initializer #define ggml_vld1q_s8_x4 vld1q_s8_x4 ^ ggml-quants.c:4384:48: note: in expansion of macro ‘ggml_vld1q_s8_x4’ const ggml_int8x16x4_t q8bytes_1 = ggml_vld1q_s8_x4(q8); q8 += 64; ^~~~~~~~~~~~~~~~ ggml-quants.c:407:27: error: invalid initializer #define ggml_vld1q_s8_x4 vld1q_s8_x4 ^ ggml-quants.c:4385:48: note: in expansion of macro ‘ggml_vld1q_s8_x4’ const ggml_int8x16x4_t q8bytes_2 = ggml_vld1q_s8_x4(q8); q8 += 64; ^~~~~~~~~~~~~~~~ ggml-quants.c: In function ‘ggml_vec_dot_q4_K_q8_K’: ggml-quants.c:404:27: error: invalid initializer #define ggml_vld1q_u8_x2 vld1q_u8_x2 ^ ggml-quants.c:5244:46: note: in expansion of macro ‘ggml_vld1q_u8_x2’ const ggml_uint8x16x2_t q4bits = ggml_vld1q_u8_x2(q4); q4 += 32; ^~~~~~~~~~~~~~~~ ggml-quants.c:5246:21: error: incompatible types when assigning to type ‘int8x16x2_t {aka struct int8x16x2_t}’ from type ‘int’ q8bytes = ggml_vld1q_s8_x2(q8); q8 += 32; ^ ggml-quants.c:5253:21: error: incompatible types when assigning to type ‘int8x16x2_t {aka struct int8x16x2_t}’ from type ‘int’ q8bytes = ggml_vld1q_s8_x2(q8); q8 += 32; ^ ggml-quants.c: In function ‘ggml_vec_dot_q5_K_q8_K’: ggml-quants.c:404:27: error: invalid initializer #define ggml_vld1q_u8_x2 vld1q_u8_x2 ^ ggml-quants.c:5840:36: note: in expansion of macro ‘ggml_vld1q_u8_x2’ ggml_uint8x16x2_t qhbits = ggml_vld1q_u8_x2(qh); ^~~~~~~~~~~~~~~~ ggml-quants.c:404:27: error: invalid initializer #define ggml_vld1q_u8_x2 vld1q_u8_x2 ^ ggml-quants.c:5848:46: note: in expansion of macro ‘ggml_vld1q_u8_x2’ const ggml_uint8x16x2_t q5bits = ggml_vld1q_u8_x2(q5); q5 += 32; ^~~~~~~~~~~~~~~~ ggml-quants.c:407:27: error: invalid initializer #define ggml_vld1q_s8_x4 vld1q_s8_x4 ^ ggml-quants.c:5849:46: note: in expansion of macro ‘ggml_vld1q_s8_x4’ const ggml_int8x16x4_t q8bytes = ggml_vld1q_s8_x4(q8); q8 += 64; ^~~~~~~~~~~~~~~~ ggml-quants.c: In function ‘ggml_vec_dot_q6_K_q8_K’: ggml-quants.c:403:27: error: invalid initializer #define ggml_vld1q_s16_x2 vld1q_s16_x2 ^ ggml-quants.c:6506:41: note: in expansion of macro ‘ggml_vld1q_s16_x2’ const ggml_int16x8x2_t q8sums = ggml_vld1q_s16_x2(y[i].bsums); ^~~~~~~~~~~~~~~~~ ggml-quants.c:404:27: error: invalid initializer #define ggml_vld1q_u8_x2 vld1q_u8_x2 ^ ggml-quants.c:6520:40: note: in expansion of macro ‘ggml_vld1q_u8_x2’ ggml_uint8x16x2_t qhbits = ggml_vld1q_u8_x2(qh); qh += 32; ^~~~~~~~~~~~~~~~ ggml-quants.c:405:27: error: implicit declaration of function ‘vld1q_u8_x4’; did you mean ‘vld1q_u64’? [-Werror=implicit-function-declaration] #define ggml_vld1q_u8_x4 vld1q_u8_x4 ^ ggml-quants.c:6521:40: note: in expansion of macro ‘ggml_vld1q_u8_x4’ ggml_uint8x16x4_t q6bits = ggml_vld1q_u8_x4(q6); q6 += 64; ^~~~~~~~~~~~~~~~ ggml-quants.c:405:27: error: invalid initializer #define ggml_vld1q_u8_x4 vld1q_u8_x4 ^ ggml-quants.c:6521:40: note: in expansion of macro ‘ggml_vld1q_u8_x4’ ggml_uint8x16x4_t q6bits = ggml_vld1q_u8_x4(q6); q6 += 64; ^~~~~~~~~~~~~~~~ ggml-quants.c:407:27: error: invalid initializer #define ggml_vld1q_s8_x4 vld1q_s8_x4 ^ ggml-quants.c:6522:40: note: in expansion of macro ‘ggml_vld1q_s8_x4’ ggml_int8x16x4_t q8bytes = ggml_vld1q_s8_x4(q8); q8 += 64; ^~~~~~~~~~~~~~~~ ggml-quants.c:6547:21: error: incompatible types when assigning to type ‘int8x16x4_t {aka struct int8x16x4_t}’ from type ‘int’ q8bytes = ggml_vld1q_s8_x4(q8); q8 += 64; ^ ggml-quants.c: In function ‘ggml_vec_dot_iq2_xxs_q8_K’: ggml-quants.c:7264:17: error: incompatible types when assigning to type ‘int8x16x4_t {aka struct int8x16x4_t}’ from type ‘int’ q8b = ggml_vld1q_s8_x4(q8); q8 += 64; ^ cc1: some warnings being treated as errors Makefile:552: recipe for target 'ggml-quants.o' failed make: *** [ggml-quants.o] Error 1
your cc is gcc7,not gcc8
Any updates?
I just got my TX2 working with the latest commit of the master branch(a33e6a0, 02/26 2024). And the following is what I have done.
-
A factory reset TX2 to JetPack 4.6.4, the last version which still support the Jestson TX2. JetPack 4.6.4 provide Cuda 10.2 and GCC 7.
-
Enable all six cores by
sudo nvpmodel -m 0
, use jetson-fan-ctl to keep the fan running, and jetson-stats to monitor the usage. -
Compile and install GCC 8.5 following @FantasyGmm 's guide. I have made a copy to here:
wget https://bigsearcher.com/mirrors/gcc/releases/gcc-8.5.0/gcc-8.5.0.tar.gz sudo tar -zvxf gcc-8.5.0.tar.gz --directory=/usr/local/ ./contrib/download_prerequisites mkdir build cd build sudo ../configure -enable-checking=release -enable-languages=c,c++ make -j6 make install
-
Set the correct gcc/g++
export CC=/usr/local/bin/gcc export CXX=/usr/local/bin/g++
-
Changed the line in
Makefile
fromMK_NVCCFLAGS += -O3
to
MK_NVCCFLAGS += -maxrregcount=80
The original
-O3
will cause nvcc to report an error of "nvcc fatal : redefinition of argument 'optimize'."The
-maxrregcount=80
is a workaround for the errortoo many resources for launch
during the inference. I'm not a CUDA expert, the number 80 is from this link. -
make LLAMA_CUBLAS=1 CUDA_DOCKER_ARCH=sm_62 -j 6
-
When running the llama.cpp, I still need
-ngl 33
( using llama2-7b) to exsiply offload all layers to Jetson TX2 GPU.
./main -m llama-2-7b.Q4_0.gguf -ngl 33 -c 256 -b 512 -n 128 --keep 48
llama_print_timings: load time = 15632.56 ms llama_print_timings: sample time = 3.56 ms / 24 runs ( 0.15 ms per token, 6735.90 tokens per second) llama_print_timings: prompt eval time = 13273.26 ms / 145 tokens ( 91.54 ms per token, 10.92 tokens per second) llama_print_timings: eval time = 5457.13 ms / 23 runs ( 237.27 ms per token, 4.21 tokens per second) llama_print_timings: total time = 32417.66 ms / 168 tokens
Hmm, it does compile with CUDA 10.2 (but not with CUDA 10.1 which I previously used). I didn't even bother compiling a proper gcc, just disabled the version check in /cuda-toolkit/targets/x86_64-linux/include/crt/host_config.h
Then first compiled ggml-cuda.cu by hand like so:
~/cuda-10.2/cuda-toolkit/bin/nvcc --compiler-options="" -use_fast_math -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 --use_fast_math --compiler-options="-I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wmissing-declarations -Wno-unused-function -Wno-multichar -Wno-format-truncation -Wno-array-bounds -pthread -Wno-pedantic -march=native -mtune=native " -c ggml-cuda.cu -o ggml-cuda.o
And continued with make LLAMA_CUBLAS=1 as usual.
2bf8d0f7c4cc1235755ad06961ca761e458c5e55 broke it on CUDA 10.2 @slaren @JohannesGaessler
@whoreson this is getting a bit tiresome. Are you going to ask people to harass me over this again? Let's be clear: I have no interest in supporting ancient versions of CUDA. If this is important for you, you are welcome to fix it yourself and open a PR.
I have no intention to support CUDA 10. As slaren said, if you want it supported you are free to put in the effort yourself and I will then happily review your PRs.
@slaren Very well, if reading feedback from testers is tiresome, I shall cease providing it.
No clue who came up with this "harassment" meme, I sent you an e-mail in december with a question about this (having no github acc then) and received no answer, after that I marked the commit here and said it's yours so that's where inquiries can go to. If this is what you think harassment looks like, then you're a lucky individual.
This issue was closed because it has been inactive for 14 days since being marked as stale.