llama.cpp does not compile on CUDA 10 anymore

Ever since this got merged: https://github.com/ggerganov/llama.cpp/pull/3370

Nov 18 '23 10:11 whoreson

Makefile needs to be modified because 10's nvcc doesn't have the --forward-unknown-to-host-compiler option, nor the -arch=native cuda10.patch

Nov 18 '23 10:11 whoreson

I second this with CUDA 11 on Ubuntu 22.04. I did not succeed installing CUDA 12 on my Ubuntu 22.04, so I am stuck with 11. I added following ugly hack to my Makefile, which seems to work for my system:

ifdef WEICON_BROKEN
	NVCCFLAGS += -arch=compute_86
else
	NVCCFLAGS += -arch=native
endif

Nov 18 '23 10:11 WeirdConstructor

So there is hope for me building this on windows 8.1 with cublas?

Nov 18 '23 14:11 Ph0rk0z

Makefile needs to be modified because 10's nvcc doesn't have the --forward-unknown-to-host-compiler option, nor the -arch=native cuda10.patch

Hi I tried to use your patch to compile on my Nvidia Jetson Nano, but I'm getting some new errors because of it. The jetson runs cuda 10.2, any idea what is wrong?

make LLAMA_CUBLAS=1
I llama.cpp build info: 
I UNAME_S:   Linux
I UNAME_P:   aarch64
I UNAME_M:   aarch64
I CFLAGS:    -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -Wdouble-promotion -pthread -mcpu=native 
I CXXFLAGS:  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -mcpu=native  -Wno-array-bounds -Wno-format-truncation 
I NVCCFLAGS: --compiler-options="  " -use_fast_math -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128
I LDFLAGS:   -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib 
I CC:        cc (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04) 7.5.0
I CXX:       g++ (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04) 7.5.0

cc  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -Wdouble-promotion -pthread -mcpu=native    -c ggml.c -o ggml.o
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -mcpu=native  -Wno-array-bounds -Wno-format-truncation  -c llama.cpp -o llama.o
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -mcpu=native  -Wno-array-bounds -Wno-format-truncation  -c common/common.cpp -o common.o
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -mcpu=native  -Wno-array-bounds -Wno-format-truncation  -c common/sampling.cpp -o sampling.o
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -mcpu=native  -Wno-array-bounds -Wno-format-truncation  -c common/grammar-parser.cpp -o grammar-parser.o
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -mcpu=native  -Wno-array-bounds -Wno-format-truncation  -c common/build-info.cpp -o build-info.o
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -mcpu=native  -Wno-array-bounds -Wno-format-truncation  -c common/console.cpp -o console.o
nvcc --compiler-options="  " -use_fast_math -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -c ggml-cuda.cu -o ggml-cuda.o
ggml-cuda.cu(5970): error: identifier "CUBLAS_TF32_TENSOR_OP_MATH" is undefined

ggml-cuda.cu(6617): error: identifier "CUBLAS_COMPUTE_16F" is undefined

ggml-cuda.cu(7552): error: identifier "CUBLAS_COMPUTE_16F" is undefined

ggml-cuda.cu(7586): error: identifier "CUBLAS_COMPUTE_16F" is undefined

4 errors detected in the compilation of "/tmp/tmpxft_00002e62_00000000-6_ggml-cuda.cpp1.ii".
Makefile:440: recipe for target 'ggml-cuda.o' failed
make: *** [ggml-cuda.o] Error 1

Nov 27 '23 12:11 rvandernoort

I fix follow by this: https://github.com/ggerganov/whisper.cpp/issues/1018

same error:

user@ubuntu:~/llama.cpp$ make LLAMA_CUBLAS=1
I llama.cpp build info:
I UNAME_S:   Linux
I UNAME_P:   aarch64
I UNAME_M:   aarch64
I CFLAGS:    -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/usr/local/cuda-10.2/targets/aarch64-linux/include  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -Wdouble-promotion -pthread -mcpu=armv8.3-a
I CXXFLAGS:  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/usr/local/cuda-10.2/targets/aarch64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -mcpu=armv8.3-a  -Wno-array-bounds -Wno-format-truncation
I NVCCFLAGS: --compiler-options="  " -use_fast_math -arch=compute_62 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128
I LDFLAGS:   -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/usr/local/cuda-10.2/targets/aarch64-linux/lib
I CC:        cc (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04) 7.5.0
I CXX:       g++ (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04) 7.5.0

nvcc --compiler-options="  " -use_fast_math -arch=compute_62 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -c ggml-cuda.cu -o ggml-cuda.o
ggml-cuda.cu(6947): error: identifier "CUBLAS_COMPUTE_16F" is undefined

ggml-cuda.cu(7923): error: identifier "CUBLAS_COMPUTE_16F" is undefined

ggml-cuda.cu(7957): error: identifier "CUBLAS_COMPUTE_16F" is undefined

3 errors detected in the compilation of "/tmp/tmpxft_00007758_00000000-6_ggml-cuda.cpp1.ii".
Makefile:457: recipe for target 'ggml-cuda.o' failed
make: *** [ggml-cuda.o] Error 1

Dec 13 '23 07:12 chenxuuu

0001-fix-old-jetson-compile-error.patch This patch will be useful. I checked the cuda10 documentation and modified some of the code. Now I have successfully on tx2(ub18 jetpack4 cuda10), and the performance is quite good. Need to compile and install a new gcc8, this is the latest gcc supported by cuda10, it can solve the C compiler part of the error

Dec 21 '23 06:12 FantasyGmm

It looks like a promising patch, thanks! I can only test this in the new year, unfortunately, but I'll let know the results by then.

Dec 22 '23 11:12 rvandernoort

Nice, that patch does fix the compile issue. However, something else is up:

current device: 0
GGML_ASSERT: ggml-cuda.cu:8498: !"cuBLAS error"

But it actually dies at various lines. Hmm I'll check past revisions or something.

Dec 22 '23 20:12 whoreson

Okay, it's broken since https://github.com/ggerganov/llama.cpp/commit/bcc0eb4591bec5ec02fad3f2bdcb1b265052ea56

Which is the "per-layer KV cache + quantum K cache" update.

Dec 22 '23 20:12 whoreson

0001-fix-old-jetson-compile-error.patch This patch will be useful. I checked the cuda10 documentation and modified some of the code. Now I have successfully on tx2(ub18 jetpack4 cuda10), and the performance is quite good. Need to compile and install a new gcc8, this is the latest gcc supported by cuda10, it can solve the C compiler part of the error

I tried the patch on my Nano ubuntu18 with cuda10.2, but it doesn't work for me. I believe i have the setup ok. also updates gcc and g++ to version8.

any ideas what it going wrong?

Jan 04 '24 14:01 he-man86

0001-fix-old-jetson-compile-error.patch This patch will be useful. I checked the cuda10 documentation and modified some of the code. Now I have successfully on tx2(ub18 jetpack4 cuda10), and the performance is quite good. Need to compile and install a new gcc8, this is the latest gcc supported by cuda10, it can solve the C compiler part of the error

I tried the patch on my Nano ubuntu18 with cuda10.2, but it doesn't work for me. I believe i have the setup ok. also updates gcc and g++ to version8.

any ideas what it going wrong?

I tested it on Jetson Tx2 and compiled gcc 8.5 myself. Do not use gcc8 from the apt source , it does not work,I have submitted the content of the patch to the repository, you can directly compile it using the latest code

Jan 04 '24 15:01 FantasyGmm

0001-fix-old-jetson-compile-error.patch This patch will be useful. I checked the cuda10 documentation and modified some of the code. Now I have successfully on tx2(ub18 jetpack4 cuda10), and the performance is quite good. Need to compile and install a new gcc8, this is the latest gcc supported by cuda10, it can solve the C compiler part of the error

I tried the patch on my Nano ubuntu18 with cuda10.2, but it doesn't work for me. I believe i have the setup ok. also updates gcc and g++ to version8. any ideas what it going wrong?

I tested it on Jetson Tx2 and compiled gcc 8.5 myself. Do not use gcc8 from the apt source , it does not work,I have submitted the content of the patch to the repository, you can directly compile it using the latest code

Thanks a lot for the help! I am not sure what repository you mean though. Do you have one with the correctly compiled gcc8.5? i gave it a quick try to do it myself, but it has a some parameters i am not sure how to set.

Jan 04 '24 16:01 he-man86

0001-fix-old-jetson-compile-error.patch This patch will be useful. I checked the cuda10 documentation and modified some of the code. Now I have successfully on tx2(ub18 jetpack4 cuda10), and the performance is quite good. Need to compile and install a new gcc8, this is the latest gcc supported by cuda10, it can solve the C compiler part of the error

I tried the patch on my Nano ubuntu18 with cuda10.2, but it doesn't work for me. I believe i have the setup ok. also updates gcc and g++ to version8. any ideas what it going wrong?

I tested it on Jetson Tx2 and compiled gcc 8.5 myself. Do not use gcc8 from the apt source , it does not work,I have submitted the content of the patch to the repository, you can directly compile it using the latest code

Thanks a lot for the help! I am not sure what repository you mean though. Do you have one with the correctly compiled gcc8.5? i gave it a quick try to do it myself, but it has a some parameters i am not sure how to set.

sudo tar -zvxf gcc-8.5.0.tar.gz --directory=/usr/local/
./contrib/download_prerequisites
 mkdir build 
cd build 
sudo ../configure -enable-checking=release -enable-languages=c,c++
make -j6
make install
gcc -v

This takes a long long long long time and take a lot of space, You can delete the gcc folder after make install.

Jan 05 '24 02:01 FantasyGmm

UPDATE: Managed to compile now! Needed to export the gcc installation for make by:

export CC=/usr/local/bin/gcc
export CXX=/usr/local/bin/g++

I've installed gcc 8.5 from source

gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/local/libexec/gcc/aarch64-unknown-linux-gnu/8.5.0/lto-wrapper
Target: aarch64-unknown-linux-gnu
Configured with: ../configure -enable-checking=release -enable-languages=c,c++
Thread model: posix
gcc version 8.5.0 (GCC)

and after removing this line in the Makefile to get rid of an error #MK_CXXFLAGS += -mcpu=native and using CUDA_DOCKER_ARCH=sm_52, I still get the following error similar to the one with cmake i've described here#3880:

make LLAMA_CUBLAS=1 CUDA_DOCKER_ARCH=sm_52
I llama.cpp build info: 
I UNAME_S:   Linux
I UNAME_P:   aarch64
I UNAME_M:   aarch64
I CFLAGS:    -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -mcpu=native -Wdouble-promotion 
I CXXFLAGS:  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread   -Wno-array-bounds -Wno-format-truncation
I NVCCFLAGS: -use_fast_math --forward-unknown-to-host-compiler -Wno-deprecated-gpu-targets -arch=sm_52 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 
I LDFLAGS:   -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib -L/usr/local/cuda/targets/aarch64-linux/lib -L/usr/lib/wsl/lib 
I CC:        cc (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04) 7.5.0
I CXX:       g++ (GCC) 8.5.0

cc  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -mcpu=native -Wdouble-promotion    -c ggml.c -o ggml.o
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread   -Wno-array-bounds -Wno-format-truncation -c llama.cpp -o llama.o
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread   -Wno-array-bounds -Wno-format-truncation -c common/common.cpp -o common.o
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread   -Wno-array-bounds -Wno-format-truncation -c common/sampling.cpp -o sampling.o
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread   -Wno-array-bounds -Wno-format-truncation -c common/grammar-parser.cpp -o grammar-parser.o
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread   -Wno-array-bounds -Wno-format-truncation -c common/build-info.cpp -o build-info.o
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread   -Wno-array-bounds -Wno-format-truncation -c common/console.cpp -o console.o
nvcc -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -use_fast_math --forward-unknown-to-host-compiler -Wno-deprecated-gpu-targets -arch=sm_52 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128  -Wno-pedantic -Xcompiler "-Wno-array-bounds -Wno-format-truncation -Wextra-semi" -c ggml-cuda.cu -o ggml-cuda.o
ggml-cuda.cu(598): warning: function "warp_reduce_sum(half2)" was declared but never referenced

ggml-cuda.cu(619): warning: function "warp_reduce_max(half2)" was declared but never referenced

cc  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -mcpu=native -Wdouble-promotion    -c ggml-alloc.c -o ggml-alloc.o
cc  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -mcpu=native -Wdouble-promotion    -c ggml-backend.c -o ggml-backend.o
cc -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -mcpu=native -Wdouble-promotion     -c ggml-quants.c -o ggml-quants.o
ggml-quants.c: In function ‘ggml_vec_dot_q2_K_q8_K’:
ggml-quants.c:403:27: error: implicit declaration of function ‘vld1q_s16_x2’; did you mean ‘vld1q_s16’? [-Werror=implicit-function-declaration]
 #define ggml_vld1q_s16_x2 vld1q_s16_x2
                           ^
ggml-quants.c:3725:41: note: in expansion of macro ‘ggml_vld1q_s16_x2’
         const ggml_int16x8x2_t q8sums = ggml_vld1q_s16_x2(y[i].bsums);
                                         ^~~~~~~~~~~~~~~~~
ggml-quants.c:403:27: error: invalid initializer
 #define ggml_vld1q_s16_x2 vld1q_s16_x2
                           ^
ggml-quants.c:3725:41: note: in expansion of macro ‘ggml_vld1q_s16_x2’
         const ggml_int16x8x2_t q8sums = ggml_vld1q_s16_x2(y[i].bsums);
                                         ^~~~~~~~~~~~~~~~~
ggml-quants.c:404:27: error: implicit declaration of function ‘vld1q_u8_x2’; did you mean ‘vld1q_u32’? [-Werror=implicit-function-declaration]
 #define ggml_vld1q_u8_x2  vld1q_u8_x2
                           ^
ggml-quants.c:3749:46: note: in expansion of macro ‘ggml_vld1q_u8_x2’
             const ggml_uint8x16x2_t q2bits = ggml_vld1q_u8_x2(q2); q2 += 32;
                                              ^~~~~~~~~~~~~~~~
ggml-quants.c:404:27: error: invalid initializer
 #define ggml_vld1q_u8_x2  vld1q_u8_x2
                           ^
ggml-quants.c:3749:46: note: in expansion of macro ‘ggml_vld1q_u8_x2’
             const ggml_uint8x16x2_t q2bits = ggml_vld1q_u8_x2(q2); q2 += 32;
                                              ^~~~~~~~~~~~~~~~
ggml-quants.c:406:27: error: implicit declaration of function ‘vld1q_s8_x2’; did you mean ‘vld1q_s32’? [-Werror=implicit-function-declaration]
 #define ggml_vld1q_s8_x2  vld1q_s8_x2
                           ^
ggml-quants.c:3751:40: note: in expansion of macro ‘ggml_vld1q_s8_x2’
             ggml_int8x16x2_t q8bytes = ggml_vld1q_s8_x2(q8); q8 += 32;
                                        ^~~~~~~~~~~~~~~~
ggml-quants.c:406:27: error: invalid initializer
 #define ggml_vld1q_s8_x2  vld1q_s8_x2
                           ^
ggml-quants.c:3751:40: note: in expansion of macro ‘ggml_vld1q_s8_x2’
             ggml_int8x16x2_t q8bytes = ggml_vld1q_s8_x2(q8); q8 += 32;
                                        ^~~~~~~~~~~~~~~~
ggml-quants.c:3743:17: error: incompatible types when assigning to type ‘int8x16x2_t {aka struct int8x16x2_t}’ from type ‘int’
         q8bytes = ggml_vld1q_s8_x2(q8); q8 += 32;\
                 ^
ggml-quants.c:3757:13: note: in expansion of macro ‘SHIFT_MULTIPLY_ACCUM_WITH_SCALE’
             SHIFT_MULTIPLY_ACCUM_WITH_SCALE(2, 2);
             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
ggml-quants.c:3743:17: error: incompatible types when assigning to type ‘int8x16x2_t {aka struct int8x16x2_t}’ from type ‘int’
         q8bytes = ggml_vld1q_s8_x2(q8); q8 += 32;\
                 ^
ggml-quants.c:3758:13: note: in expansion of macro ‘SHIFT_MULTIPLY_ACCUM_WITH_SCALE’
             SHIFT_MULTIPLY_ACCUM_WITH_SCALE(4, 4);
             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
ggml-quants.c:3743:17: error: incompatible types when assigning to type ‘int8x16x2_t {aka struct int8x16x2_t}’ from type ‘int’
         q8bytes = ggml_vld1q_s8_x2(q8); q8 += 32;\
                 ^
ggml-quants.c:3759:13: note: in expansion of macro ‘SHIFT_MULTIPLY_ACCUM_WITH_SCALE’
             SHIFT_MULTIPLY_ACCUM_WITH_SCALE(6, 6);
             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
ggml-quants.c: In function ‘ggml_vec_dot_q3_K_q8_K’:
ggml-quants.c:404:27: error: invalid initializer
 #define ggml_vld1q_u8_x2  vld1q_u8_x2
                           ^
ggml-quants.c:4365:36: note: in expansion of macro ‘ggml_vld1q_u8_x2’
         ggml_uint8x16x2_t qhbits = ggml_vld1q_u8_x2(qh);
                                    ^~~~~~~~~~~~~~~~
ggml-quants.c:404:27: error: invalid initializer
 #define ggml_vld1q_u8_x2  vld1q_u8_x2
                           ^
ggml-quants.c:4383:46: note: in expansion of macro ‘ggml_vld1q_u8_x2’
             const ggml_uint8x16x2_t q3bits = ggml_vld1q_u8_x2(q3); q3 += 32;
                                              ^~~~~~~~~~~~~~~~
ggml-quants.c:407:27: error: implicit declaration of function ‘vld1q_s8_x4’; did you mean ‘vld1q_s64’? [-Werror=implicit-function-declaration]
 #define ggml_vld1q_s8_x4  vld1q_s8_x4
                           ^
ggml-quants.c:4384:48: note: in expansion of macro ‘ggml_vld1q_s8_x4’
             const ggml_int8x16x4_t q8bytes_1 = ggml_vld1q_s8_x4(q8); q8 += 64;
                                                ^~~~~~~~~~~~~~~~
ggml-quants.c:407:27: error: invalid initializer
 #define ggml_vld1q_s8_x4  vld1q_s8_x4
                           ^
ggml-quants.c:4384:48: note: in expansion of macro ‘ggml_vld1q_s8_x4’
             const ggml_int8x16x4_t q8bytes_1 = ggml_vld1q_s8_x4(q8); q8 += 64;
                                                ^~~~~~~~~~~~~~~~
ggml-quants.c:407:27: error: invalid initializer
 #define ggml_vld1q_s8_x4  vld1q_s8_x4
                           ^
ggml-quants.c:4385:48: note: in expansion of macro ‘ggml_vld1q_s8_x4’
             const ggml_int8x16x4_t q8bytes_2 = ggml_vld1q_s8_x4(q8); q8 += 64;
                                                ^~~~~~~~~~~~~~~~
ggml-quants.c: In function ‘ggml_vec_dot_q4_K_q8_K’:
ggml-quants.c:404:27: error: invalid initializer
 #define ggml_vld1q_u8_x2  vld1q_u8_x2
                           ^
ggml-quants.c:5244:46: note: in expansion of macro ‘ggml_vld1q_u8_x2’
             const ggml_uint8x16x2_t q4bits = ggml_vld1q_u8_x2(q4); q4 += 32;
                                              ^~~~~~~~~~~~~~~~
ggml-quants.c:5246:21: error: incompatible types when assigning to type ‘int8x16x2_t {aka struct int8x16x2_t}’ from type ‘int’
             q8bytes = ggml_vld1q_s8_x2(q8); q8 += 32;
                     ^
ggml-quants.c:5253:21: error: incompatible types when assigning to type ‘int8x16x2_t {aka struct int8x16x2_t}’ from type ‘int’
             q8bytes = ggml_vld1q_s8_x2(q8); q8 += 32;
                     ^
ggml-quants.c: In function ‘ggml_vec_dot_q5_K_q8_K’:
ggml-quants.c:404:27: error: invalid initializer
 #define ggml_vld1q_u8_x2  vld1q_u8_x2
                           ^
ggml-quants.c:5840:36: note: in expansion of macro ‘ggml_vld1q_u8_x2’
         ggml_uint8x16x2_t qhbits = ggml_vld1q_u8_x2(qh);
                                    ^~~~~~~~~~~~~~~~
ggml-quants.c:404:27: error: invalid initializer
 #define ggml_vld1q_u8_x2  vld1q_u8_x2
                           ^
ggml-quants.c:5848:46: note: in expansion of macro ‘ggml_vld1q_u8_x2’
             const ggml_uint8x16x2_t q5bits = ggml_vld1q_u8_x2(q5); q5 += 32;
                                              ^~~~~~~~~~~~~~~~
ggml-quants.c:407:27: error: invalid initializer
 #define ggml_vld1q_s8_x4  vld1q_s8_x4
                           ^
ggml-quants.c:5849:46: note: in expansion of macro ‘ggml_vld1q_s8_x4’
             const ggml_int8x16x4_t q8bytes = ggml_vld1q_s8_x4(q8); q8 += 64;
                                              ^~~~~~~~~~~~~~~~
ggml-quants.c: In function ‘ggml_vec_dot_q6_K_q8_K’:
ggml-quants.c:403:27: error: invalid initializer
 #define ggml_vld1q_s16_x2 vld1q_s16_x2
                           ^
ggml-quants.c:6506:41: note: in expansion of macro ‘ggml_vld1q_s16_x2’
         const ggml_int16x8x2_t q8sums = ggml_vld1q_s16_x2(y[i].bsums);
                                         ^~~~~~~~~~~~~~~~~
ggml-quants.c:404:27: error: invalid initializer
 #define ggml_vld1q_u8_x2  vld1q_u8_x2
                           ^
ggml-quants.c:6520:40: note: in expansion of macro ‘ggml_vld1q_u8_x2’
             ggml_uint8x16x2_t qhbits = ggml_vld1q_u8_x2(qh); qh += 32;
                                        ^~~~~~~~~~~~~~~~
ggml-quants.c:405:27: error: implicit declaration of function ‘vld1q_u8_x4’; did you mean ‘vld1q_u64’? [-Werror=implicit-function-declaration]
 #define ggml_vld1q_u8_x4  vld1q_u8_x4
                           ^
ggml-quants.c:6521:40: note: in expansion of macro ‘ggml_vld1q_u8_x4’
             ggml_uint8x16x4_t q6bits = ggml_vld1q_u8_x4(q6); q6 += 64;
                                        ^~~~~~~~~~~~~~~~
ggml-quants.c:405:27: error: invalid initializer
 #define ggml_vld1q_u8_x4  vld1q_u8_x4
                           ^
ggml-quants.c:6521:40: note: in expansion of macro ‘ggml_vld1q_u8_x4’
             ggml_uint8x16x4_t q6bits = ggml_vld1q_u8_x4(q6); q6 += 64;
                                        ^~~~~~~~~~~~~~~~
ggml-quants.c:407:27: error: invalid initializer
 #define ggml_vld1q_s8_x4  vld1q_s8_x4
                           ^
ggml-quants.c:6522:40: note: in expansion of macro ‘ggml_vld1q_s8_x4’
             ggml_int8x16x4_t q8bytes = ggml_vld1q_s8_x4(q8); q8 += 64;
                                        ^~~~~~~~~~~~~~~~
ggml-quants.c:6547:21: error: incompatible types when assigning to type ‘int8x16x4_t {aka struct int8x16x4_t}’ from type ‘int’
             q8bytes = ggml_vld1q_s8_x4(q8); q8 += 64;
                     ^
ggml-quants.c: In function ‘ggml_vec_dot_iq2_xxs_q8_K’:
ggml-quants.c:7264:17: error: incompatible types when assigning to type ‘int8x16x4_t {aka struct int8x16x4_t}’ from type ‘int’
             q8b = ggml_vld1q_s8_x4(q8); q8 += 64;
                 ^
cc1: some warnings being treated as errors
Makefile:552: recipe for target 'ggml-quants.o' failed
make: *** [ggml-quants.o] Error 1

Jan 11 '24 14:01 rvandernoort

UPDATE: Managed to compile now! Needed to export the gcc installation for make by:


export CC=/usr/local/bin/gcc

export CXX=/usr/local/bin/g++

I've installed gcc 8.5 from source


gcc -v

Using built-in specs.

COLLECT_GCC=gcc

COLLECT_LTO_WRAPPER=/usr/local/libexec/gcc/aarch64-unknown-linux-gnu/8.5.0/lto-wrapper

Target: aarch64-unknown-linux-gnu

Configured with: ../configure -enable-checking=release -enable-languages=c,c++

Thread model: posix

gcc version 8.5.0 (GCC)

and after removing this line in the Makefile to get rid of an error #MK_CXXFLAGS += -mcpu=native and using CUDA_DOCKER_ARCH=sm_52, I still get the following error similar to the one with cmake i've described here#3880:


make LLAMA_CUBLAS=1 CUDA_DOCKER_ARCH=sm_52

I llama.cpp build info: 

I UNAME_S:   Linux

I UNAME_P:   aarch64

I UNAME_M:   aarch64

I CFLAGS:    -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -mcpu=native -Wdouble-promotion 

I CXXFLAGS:  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread   -Wno-array-bounds -Wno-format-truncation

I NVCCFLAGS: -use_fast_math --forward-unknown-to-host-compiler -Wno-deprecated-gpu-targets -arch=sm_52 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 

I LDFLAGS:   -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib -L/usr/local/cuda/targets/aarch64-linux/lib -L/usr/lib/wsl/lib 

I CC:        cc (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04) 7.5.0

I CXX:       g++ (GCC) 8.5.0



cc  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -mcpu=native -Wdouble-promotion    -c ggml.c -o ggml.o

g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread   -Wno-array-bounds -Wno-format-truncation -c llama.cpp -o llama.o

g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread   -Wno-array-bounds -Wno-format-truncation -c common/common.cpp -o common.o

g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread   -Wno-array-bounds -Wno-format-truncation -c common/sampling.cpp -o sampling.o

g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread   -Wno-array-bounds -Wno-format-truncation -c common/grammar-parser.cpp -o grammar-parser.o

g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread   -Wno-array-bounds -Wno-format-truncation -c common/build-info.cpp -o build-info.o

g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread   -Wno-array-bounds -Wno-format-truncation -c common/console.cpp -o console.o

nvcc -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -use_fast_math --forward-unknown-to-host-compiler -Wno-deprecated-gpu-targets -arch=sm_52 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128  -Wno-pedantic -Xcompiler "-Wno-array-bounds -Wno-format-truncation -Wextra-semi" -c ggml-cuda.cu -o ggml-cuda.o

ggml-cuda.cu(598): warning: function "warp_reduce_sum(half2)" was declared but never referenced



ggml-cuda.cu(619): warning: function "warp_reduce_max(half2)" was declared but never referenced



cc  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -mcpu=native -Wdouble-promotion    -c ggml-alloc.c -o ggml-alloc.o

cc  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -mcpu=native -Wdouble-promotion    -c ggml-backend.c -o ggml-backend.o

cc -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -mcpu=native -Wdouble-promotion     -c ggml-quants.c -o ggml-quants.o

ggml-quants.c: In function ‘ggml_vec_dot_q2_K_q8_K’:

ggml-quants.c:403:27: error: implicit declaration of function ‘vld1q_s16_x2’; did you mean ‘vld1q_s16’? [-Werror=implicit-function-declaration]

 #define ggml_vld1q_s16_x2 vld1q_s16_x2

                           ^

ggml-quants.c:3725:41: note: in expansion of macro ‘ggml_vld1q_s16_x2’

         const ggml_int16x8x2_t q8sums = ggml_vld1q_s16_x2(y[i].bsums);

                                         ^~~~~~~~~~~~~~~~~

ggml-quants.c:403:27: error: invalid initializer

 #define ggml_vld1q_s16_x2 vld1q_s16_x2

                           ^

ggml-quants.c:3725:41: note: in expansion of macro ‘ggml_vld1q_s16_x2’

         const ggml_int16x8x2_t q8sums = ggml_vld1q_s16_x2(y[i].bsums);

                                         ^~~~~~~~~~~~~~~~~

ggml-quants.c:404:27: error: implicit declaration of function ‘vld1q_u8_x2’; did you mean ‘vld1q_u32’? [-Werror=implicit-function-declaration]

 #define ggml_vld1q_u8_x2  vld1q_u8_x2

                           ^

ggml-quants.c:3749:46: note: in expansion of macro ‘ggml_vld1q_u8_x2’

             const ggml_uint8x16x2_t q2bits = ggml_vld1q_u8_x2(q2); q2 += 32;

                                              ^~~~~~~~~~~~~~~~

ggml-quants.c:404:27: error: invalid initializer

 #define ggml_vld1q_u8_x2  vld1q_u8_x2

                           ^

ggml-quants.c:3749:46: note: in expansion of macro ‘ggml_vld1q_u8_x2’

             const ggml_uint8x16x2_t q2bits = ggml_vld1q_u8_x2(q2); q2 += 32;

                                              ^~~~~~~~~~~~~~~~

ggml-quants.c:406:27: error: implicit declaration of function ‘vld1q_s8_x2’; did you mean ‘vld1q_s32’? [-Werror=implicit-function-declaration]

 #define ggml_vld1q_s8_x2  vld1q_s8_x2

                           ^

ggml-quants.c:3751:40: note: in expansion of macro ‘ggml_vld1q_s8_x2’

             ggml_int8x16x2_t q8bytes = ggml_vld1q_s8_x2(q8); q8 += 32;

                                        ^~~~~~~~~~~~~~~~

ggml-quants.c:406:27: error: invalid initializer

 #define ggml_vld1q_s8_x2  vld1q_s8_x2

                           ^

ggml-quants.c:3751:40: note: in expansion of macro ‘ggml_vld1q_s8_x2’

             ggml_int8x16x2_t q8bytes = ggml_vld1q_s8_x2(q8); q8 += 32;

                                        ^~~~~~~~~~~~~~~~

ggml-quants.c:3743:17: error: incompatible types when assigning to type ‘int8x16x2_t {aka struct int8x16x2_t}’ from type ‘int’

         q8bytes = ggml_vld1q_s8_x2(q8); q8 += 32;\

                 ^

ggml-quants.c:3757:13: note: in expansion of macro ‘SHIFT_MULTIPLY_ACCUM_WITH_SCALE’

             SHIFT_MULTIPLY_ACCUM_WITH_SCALE(2, 2);

             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

ggml-quants.c:3743:17: error: incompatible types when assigning to type ‘int8x16x2_t {aka struct int8x16x2_t}’ from type ‘int’

         q8bytes = ggml_vld1q_s8_x2(q8); q8 += 32;\

                 ^

ggml-quants.c:3758:13: note: in expansion of macro ‘SHIFT_MULTIPLY_ACCUM_WITH_SCALE’

             SHIFT_MULTIPLY_ACCUM_WITH_SCALE(4, 4);

             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

ggml-quants.c:3743:17: error: incompatible types when assigning to type ‘int8x16x2_t {aka struct int8x16x2_t}’ from type ‘int’

         q8bytes = ggml_vld1q_s8_x2(q8); q8 += 32;\

                 ^

ggml-quants.c:3759:13: note: in expansion of macro ‘SHIFT_MULTIPLY_ACCUM_WITH_SCALE’

             SHIFT_MULTIPLY_ACCUM_WITH_SCALE(6, 6);

             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

ggml-quants.c: In function ‘ggml_vec_dot_q3_K_q8_K’:

ggml-quants.c:404:27: error: invalid initializer

 #define ggml_vld1q_u8_x2  vld1q_u8_x2

                           ^

ggml-quants.c:4365:36: note: in expansion of macro ‘ggml_vld1q_u8_x2’

         ggml_uint8x16x2_t qhbits = ggml_vld1q_u8_x2(qh);

                                    ^~~~~~~~~~~~~~~~

ggml-quants.c:404:27: error: invalid initializer

 #define ggml_vld1q_u8_x2  vld1q_u8_x2

                           ^

ggml-quants.c:4383:46: note: in expansion of macro ‘ggml_vld1q_u8_x2’

             const ggml_uint8x16x2_t q3bits = ggml_vld1q_u8_x2(q3); q3 += 32;

                                              ^~~~~~~~~~~~~~~~

ggml-quants.c:407:27: error: implicit declaration of function ‘vld1q_s8_x4’; did you mean ‘vld1q_s64’? [-Werror=implicit-function-declaration]

 #define ggml_vld1q_s8_x4  vld1q_s8_x4

                           ^

ggml-quants.c:4384:48: note: in expansion of macro ‘ggml_vld1q_s8_x4’

             const ggml_int8x16x4_t q8bytes_1 = ggml_vld1q_s8_x4(q8); q8 += 64;

                                                ^~~~~~~~~~~~~~~~

ggml-quants.c:407:27: error: invalid initializer

 #define ggml_vld1q_s8_x4  vld1q_s8_x4

                           ^

ggml-quants.c:4384:48: note: in expansion of macro ‘ggml_vld1q_s8_x4’

             const ggml_int8x16x4_t q8bytes_1 = ggml_vld1q_s8_x4(q8); q8 += 64;

                                                ^~~~~~~~~~~~~~~~

ggml-quants.c:407:27: error: invalid initializer

 #define ggml_vld1q_s8_x4  vld1q_s8_x4

                           ^

ggml-quants.c:4385:48: note: in expansion of macro ‘ggml_vld1q_s8_x4’

             const ggml_int8x16x4_t q8bytes_2 = ggml_vld1q_s8_x4(q8); q8 += 64;

                                                ^~~~~~~~~~~~~~~~

ggml-quants.c: In function ‘ggml_vec_dot_q4_K_q8_K’:

ggml-quants.c:404:27: error: invalid initializer

 #define ggml_vld1q_u8_x2  vld1q_u8_x2

                           ^

ggml-quants.c:5244:46: note: in expansion of macro ‘ggml_vld1q_u8_x2’

             const ggml_uint8x16x2_t q4bits = ggml_vld1q_u8_x2(q4); q4 += 32;

                                              ^~~~~~~~~~~~~~~~

ggml-quants.c:5246:21: error: incompatible types when assigning to type ‘int8x16x2_t {aka struct int8x16x2_t}’ from type ‘int’

             q8bytes = ggml_vld1q_s8_x2(q8); q8 += 32;

                     ^

ggml-quants.c:5253:21: error: incompatible types when assigning to type ‘int8x16x2_t {aka struct int8x16x2_t}’ from type ‘int’

             q8bytes = ggml_vld1q_s8_x2(q8); q8 += 32;

                     ^

ggml-quants.c: In function ‘ggml_vec_dot_q5_K_q8_K’:

ggml-quants.c:404:27: error: invalid initializer

 #define ggml_vld1q_u8_x2  vld1q_u8_x2

                           ^

ggml-quants.c:5840:36: note: in expansion of macro ‘ggml_vld1q_u8_x2’

         ggml_uint8x16x2_t qhbits = ggml_vld1q_u8_x2(qh);

                                    ^~~~~~~~~~~~~~~~

ggml-quants.c:404:27: error: invalid initializer

 #define ggml_vld1q_u8_x2  vld1q_u8_x2

                           ^

ggml-quants.c:5848:46: note: in expansion of macro ‘ggml_vld1q_u8_x2’

             const ggml_uint8x16x2_t q5bits = ggml_vld1q_u8_x2(q5); q5 += 32;

                                              ^~~~~~~~~~~~~~~~

ggml-quants.c:407:27: error: invalid initializer

 #define ggml_vld1q_s8_x4  vld1q_s8_x4

                           ^

ggml-quants.c:5849:46: note: in expansion of macro ‘ggml_vld1q_s8_x4’

             const ggml_int8x16x4_t q8bytes = ggml_vld1q_s8_x4(q8); q8 += 64;

                                              ^~~~~~~~~~~~~~~~

ggml-quants.c: In function ‘ggml_vec_dot_q6_K_q8_K’:

ggml-quants.c:403:27: error: invalid initializer

 #define ggml_vld1q_s16_x2 vld1q_s16_x2

                           ^

ggml-quants.c:6506:41: note: in expansion of macro ‘ggml_vld1q_s16_x2’

         const ggml_int16x8x2_t q8sums = ggml_vld1q_s16_x2(y[i].bsums);

                                         ^~~~~~~~~~~~~~~~~

ggml-quants.c:404:27: error: invalid initializer

 #define ggml_vld1q_u8_x2  vld1q_u8_x2

                           ^

ggml-quants.c:6520:40: note: in expansion of macro ‘ggml_vld1q_u8_x2’

             ggml_uint8x16x2_t qhbits = ggml_vld1q_u8_x2(qh); qh += 32;

                                        ^~~~~~~~~~~~~~~~

ggml-quants.c:405:27: error: implicit declaration of function ‘vld1q_u8_x4’; did you mean ‘vld1q_u64’? [-Werror=implicit-function-declaration]

 #define ggml_vld1q_u8_x4  vld1q_u8_x4

                           ^

ggml-quants.c:6521:40: note: in expansion of macro ‘ggml_vld1q_u8_x4’

             ggml_uint8x16x4_t q6bits = ggml_vld1q_u8_x4(q6); q6 += 64;

                                        ^~~~~~~~~~~~~~~~

ggml-quants.c:405:27: error: invalid initializer

 #define ggml_vld1q_u8_x4  vld1q_u8_x4

                           ^

ggml-quants.c:6521:40: note: in expansion of macro ‘ggml_vld1q_u8_x4’

             ggml_uint8x16x4_t q6bits = ggml_vld1q_u8_x4(q6); q6 += 64;

                                        ^~~~~~~~~~~~~~~~

ggml-quants.c:407:27: error: invalid initializer

 #define ggml_vld1q_s8_x4  vld1q_s8_x4

                           ^

ggml-quants.c:6522:40: note: in expansion of macro ‘ggml_vld1q_s8_x4’

             ggml_int8x16x4_t q8bytes = ggml_vld1q_s8_x4(q8); q8 += 64;

                                        ^~~~~~~~~~~~~~~~

ggml-quants.c:6547:21: error: incompatible types when assigning to type ‘int8x16x4_t {aka struct int8x16x4_t}’ from type ‘int’

             q8bytes = ggml_vld1q_s8_x4(q8); q8 += 64;

                     ^

ggml-quants.c: In function ‘ggml_vec_dot_iq2_xxs_q8_K’:

ggml-quants.c:7264:17: error: incompatible types when assigning to type ‘int8x16x4_t {aka struct int8x16x4_t}’ from type ‘int’

             q8b = ggml_vld1q_s8_x4(q8); q8 += 64;

                 ^

cc1: some warnings being treated as errors

Makefile:552: recipe for target 'ggml-quants.o' failed

make: *** [ggml-quants.o] Error 1

your cc is gcc7，not gcc8

Jan 12 '24 13:01 FantasyGmm

Any updates?

Feb 26 '24 01:02 oiwn

I just got my TX2 working with the latest commit of the master branch(a33e6a0, 02/26 2024). And the following is what I have done.

A factory reset TX2 to JetPack 4.6.4, the last version which still support the Jestson TX2. JetPack 4.6.4 provide Cuda 10.2 and GCC 7.
Enable all six cores by sudo nvpmodel -m 0, use jetson-fan-ctl to keep the fan running, and jetson-stats to monitor the usage.

Compile and install GCC 8.5 following @FantasyGmm 's guide. I have made a copy to here:

wget https://bigsearcher.com/mirrors/gcc/releases/gcc-8.5.0/gcc-8.5.0.tar.gz
sudo tar -zvxf gcc-8.5.0.tar.gz --directory=/usr/local/
./contrib/download_prerequisites
mkdir build 
cd build 
sudo ../configure -enable-checking=release -enable-languages=c,c++
make -j6
make install

Set the correct gcc/g++

export CC=/usr/local/bin/gcc
export CXX=/usr/local/bin/g++

Changed the line in Makefile from
```
MK_NVCCFLAGS  += -O3
```
to
```
MK_NVCCFLAGS += -maxrregcount=80
```
The original -O3 will cause nvcc to report an error of "nvcc fatal : redefinition of argument 'optimize'."

The -maxrregcount=80 is a workaround for the error too many resources for launch during the inference. I'm not a CUDA expert, the number 80 is from this link.
make LLAMA_CUBLAS=1 CUDA_DOCKER_ARCH=sm_62 -j 6
When running the llama.cpp, I still need -ngl 33 ( using llama2-7b) to exsiply offload all layers to Jetson TX2 GPU.

./main -m llama-2-7b.Q4_0.gguf -ngl 33 -c 256 -b 512 -n 128 --keep 48

llama_print_timings: load time = 15632.56 ms llama_print_timings: sample time = 3.56 ms / 24 runs ( 0.15 ms per token, 6735.90 tokens per second) llama_print_timings: prompt eval time = 13273.26 ms / 145 tokens ( 91.54 ms per token, 10.92 tokens per second) llama_print_timings: eval time = 5457.13 ms / 23 runs ( 237.27 ms per token, 4.21 tokens per second) llama_print_timings: total time = 32417.66 ms / 168 tokens

Feb 26 '24 21:02 otaGran

Hmm, it does compile with CUDA 10.2 (but not with CUDA 10.1 which I previously used). I didn't even bother compiling a proper gcc, just disabled the version check in /cuda-toolkit/targets/x86_64-linux/include/crt/host_config.h

Then first compiled ggml-cuda.cu by hand like so:

~/cuda-10.2/cuda-toolkit/bin/nvcc --compiler-options="" -use_fast_math -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 --use_fast_math --compiler-options="-I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wmissing-declarations -Wno-unused-function -Wno-multichar -Wno-format-truncation -Wno-array-bounds -pthread    -Wno-pedantic -march=native -mtune=native " -c ggml-cuda.cu -o ggml-cuda.o

And continued with make LLAMA_CUBLAS=1 as usual.

Mar 02 '24 13:03 whoreson

2bf8d0f7c4cc1235755ad06961ca761e458c5e55 broke it on CUDA 10.2 @slaren @JohannesGaessler

Mar 21 '24 19:03 whoreson

@whoreson this is getting a bit tiresome. Are you going to ask people to harass me over this again? Let's be clear: I have no interest in supporting ancient versions of CUDA. If this is important for you, you are welcome to fix it yourself and open a PR.

Mar 21 '24 19:03 slaren

I have no intention to support CUDA 10. As slaren said, if you want it supported you are free to put in the effort yourself and I will then happily review your PRs.

Mar 21 '24 20:03 JohannesGaessler

@slaren Very well, if reading feedback from testers is tiresome, I shall cease providing it.

Mar 22 '24 08:03 whoreson

No clue who came up with this "harassment" meme, I sent you an e-mail in december with a question about this (having no github acc then) and received no answer, after that I marked the commit here and said it's yours so that's where inquiries can go to. If this is what you think harassment looks like, then you're a lucky individual.

Mar 22 '24 08:03 whoreson

This issue was closed because it has been inactive for 14 days since being marked as stale.

May 07 '24 01:05 github-actions[bot]

llama.cpp llama.cpp copied to clipboard

does not compile on CUDA 10 anymore

llama.cpp
llama.cpp copied to clipboard