llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Compilation error on Nvidia Jetson Nano

Open rvandernoort opened this issue 1 year ago • 21 comments

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • [x] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • [x] I carefully followed the README.md.
  • [x] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • [x] I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

Please provide a detailed written description of what you were trying to do, and what you expected llama.cpp to do.

Hi! I'm trying to compile llamacpp on an Nvidia Jetson Nano 2GB with CuBLAS, because I want to use the cuda cores, but I'm facing some issues with compilation.

Current Behavior

Please provide a detailed written description of what llama.cpp did, instead.

Both make and cmake compilation methods results in various errors.

Environment and Context

Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.

  • Physical (or virtual) hardware you are using, e.g. for Linux:

$ lscpu

Architecture:        aarch64
Byte Order:          Little Endian
CPU(s):              4
On-line CPU(s) list: 0-3
Thread(s) per core:  1
Core(s) per socket:  4
Socket(s):           1
Vendor ID:           ARM
Model:               1
Model name:          Cortex-A57
Stepping:            r1p1
CPU max MHz:         1479,0000
CPU min MHz:         102,0000
BogoMIPS:            38.40
L1d cache:           32K
L1i cache:           48K
L2 cache:            2048K
Flags:               fp asimd evtstrm aes pmull sha1 sha2 crc32
  • Operating System, e.g. for Linux:

$ uname -a

Linux rover-NVIDIA-JETSON 4.9.337-tegra #1 SMP PREEMPT Thu Jun 8 21:19:14 PDT 2023 aarch64 aarch64 aarch64 GNU/Linux

  • SDK version, e.g. for Linux:
$ python3 --version
Python 3.6.9
$ make --version
GNU Make 4.1
$ g++ --version
g++ (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04) 7.5.0
$ cmake --version
cmake version 3.28.0-rc4
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Sun_Feb_28_22:34:44_PST_2021
Cuda compilation tools, release 10.2, V10.2.300
Build cuda_10.2_r440.TC440_70.29663091_0

Failure Information (for bugs)

Please help provide information about the failure / bug.

Steps to Reproduce

Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.

  1. make
  2. make LLAMA_CUBLAS=1
  3. make LLAMA_CUBLAS=1 with -arch=compute_53
  4. make LLAMA_CUBLAS=1 with -arch=compute_53 and removed -mcpu
  5. cmake .. + cmake --build . --config Release (failure)
  6. cmake .. -DLLAMA_CUBLAS=ON + cmake --build . --config Release

Failure Logs

$ make
I llama.cpp build info: 
I UNAME_S:   Linux
I UNAME_P:   aarch64
I UNAME_M:   aarch64
I CFLAGS:    -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -Wdouble-promotion -pthread -mcpu=native 
I CXXFLAGS:  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -mcpu=native  -Wno-array-bounds -Wno-format-truncation 
I NVCCFLAGS:  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -mcpu=native    -Wno-pedantic -Xcompiler "-Wno-array-bounds -Wno-format-truncation "
I LDFLAGS:    
I CC:        cc (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04) 7.5.0
I CXX:       g++ (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04) 7.5.0

cc -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -Wdouble-promotion -pthread -mcpu=native     -c ggml-quants.c -o ggml-quants.o
ggml-quants.c: In function ‘ggml_vec_dot_q2_K_q8_K’:
ggml-quants.c:403:27: error: implicit declaration of function ‘vld1q_s16_x2’; did you mean ‘vld1q_s16’? [-Werror=implicit-function-declaration]
 #define ggml_vld1q_s16_x2 vld1q_s16_x2
                           ^
ggml-quants.c:3679:41: note: in expansion of macro ‘ggml_vld1q_s16_x2’
         const ggml_int16x8x2_t q8sums = ggml_vld1q_s16_x2(y[i].bsums);
                                         ^~~~~~~~~~~~~~~~~
ggml-quants.c:403:27: error: invalid initializer
 #define ggml_vld1q_s16_x2 vld1q_s16_x2
                           ^
ggml-quants.c:3679:41: note: in expansion of macro ‘ggml_vld1q_s16_x2’
         const ggml_int16x8x2_t q8sums = ggml_vld1q_s16_x2(y[i].bsums);
                                         ^~~~~~~~~~~~~~~~~
ggml-quants.c:3680:41: warning: missing braces around initializer [-Wmissing-braces]
         const ggml_int16x8x2_t mins16 = {vreinterpretq_s16_u16(vmovl_u8(vget_low_u8(mins))), vreinterpretq_s16_u16(vmovl_u8(vget_high_u8(mins)))};
                                         ^
                                          {                                                                                                      }
ggml-quants.c:404:27: error: implicit declaration of function ‘vld1q_u8_x2’; did you mean ‘vld1q_u32’? [-Werror=implicit-function-declaration]
 #define ggml_vld1q_u8_x2  vld1q_u8_x2
                           ^
ggml-quants.c:3716:46: note: in expansion of macro ‘ggml_vld1q_u8_x2’
             const ggml_uint8x16x2_t q2bits = ggml_vld1q_u8_x2(q2); q2 += 32;
                                              ^~~~~~~~~~~~~~~~
ggml-quants.c:404:27: error: invalid initializer
 #define ggml_vld1q_u8_x2  vld1q_u8_x2
                           ^
ggml-quants.c:3716:46: note: in expansion of macro ‘ggml_vld1q_u8_x2’
             const ggml_uint8x16x2_t q2bits = ggml_vld1q_u8_x2(q2); q2 += 32;
                                              ^~~~~~~~~~~~~~~~
ggml-quants.c:406:27: error: implicit declaration of function ‘vld1q_s8_x2’; did you mean ‘vld1q_s32’? [-Werror=implicit-function-declaration]
 #define ggml_vld1q_s8_x2  vld1q_s8_x2
                           ^
ggml-quants.c:3718:40: note: in expansion of macro ‘ggml_vld1q_s8_x2’
             ggml_int8x16x2_t q8bytes = ggml_vld1q_s8_x2(q8); q8 += 32;
                                        ^~~~~~~~~~~~~~~~
ggml-quants.c:406:27: error: invalid initializer
 #define ggml_vld1q_s8_x2  vld1q_s8_x2
                           ^
ggml-quants.c:3718:40: note: in expansion of macro ‘ggml_vld1q_s8_x2’
             ggml_int8x16x2_t q8bytes = ggml_vld1q_s8_x2(q8); q8 += 32;
                                        ^~~~~~~~~~~~~~~~
ggml-quants.c:3708:17: error: incompatible types when assigning to type ‘int8x16x2_t {aka struct int8x16x2_t}’ from type ‘int’
         q8bytes = ggml_vld1q_s8_x2(q8); q8 += 32;\
                 ^
ggml-quants.c:3723:13: note: in expansion of macro ‘SHIFT_MULTIPLY_ACCUM_WITH_SCALE’
             SHIFT_MULTIPLY_ACCUM_WITH_SCALE(2, 2);
             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
ggml-quants.c:3708:17: error: incompatible types when assigning to type ‘int8x16x2_t {aka struct int8x16x2_t}’ from type ‘int’
         q8bytes = ggml_vld1q_s8_x2(q8); q8 += 32;\
                 ^
ggml-quants.c:3725:13: note: in expansion of macro ‘SHIFT_MULTIPLY_ACCUM_WITH_SCALE’
             SHIFT_MULTIPLY_ACCUM_WITH_SCALE(4, 4);
             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
ggml-quants.c:3708:17: error: incompatible types when assigning to type ‘int8x16x2_t {aka struct int8x16x2_t}’ from type ‘int’
         q8bytes = ggml_vld1q_s8_x2(q8); q8 += 32;\
                 ^
ggml-quants.c:3727:13: note: in expansion of macro ‘SHIFT_MULTIPLY_ACCUM_WITH_SCALE’
             SHIFT_MULTIPLY_ACCUM_WITH_SCALE(6, 6);
             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
ggml-quants.c: In function ‘ggml_vec_dot_q3_K_q8_K’:
ggml-quants.c:404:27: error: invalid initializer
 #define ggml_vld1q_u8_x2  vld1q_u8_x2
                           ^
ggml-quants.c:4353:36: note: in expansion of macro ‘ggml_vld1q_u8_x2’
         ggml_uint8x16x2_t qhbits = ggml_vld1q_u8_x2(qh);
                                    ^~~~~~~~~~~~~~~~
ggml-quants.c:404:27: error: invalid initializer
 #define ggml_vld1q_u8_x2  vld1q_u8_x2
                           ^
ggml-quants.c:4371:46: note: in expansion of macro ‘ggml_vld1q_u8_x2’
             const ggml_uint8x16x2_t q3bits = ggml_vld1q_u8_x2(q3); q3 += 32;
                                              ^~~~~~~~~~~~~~~~
ggml-quants.c:407:27: error: implicit declaration of function ‘vld1q_s8_x4’; did you mean ‘vld1q_s64’? [-Werror=implicit-function-declaration]
 #define ggml_vld1q_s8_x4  vld1q_s8_x4
                           ^
ggml-quants.c:4372:48: note: in expansion of macro ‘ggml_vld1q_s8_x4’
             const ggml_int8x16x4_t q8bytes_1 = ggml_vld1q_s8_x4(q8); q8 += 64;
                                                ^~~~~~~~~~~~~~~~
ggml-quants.c:407:27: error: invalid initializer
 #define ggml_vld1q_s8_x4  vld1q_s8_x4
                           ^
ggml-quants.c:4372:48: note: in expansion of macro ‘ggml_vld1q_s8_x4’
             const ggml_int8x16x4_t q8bytes_1 = ggml_vld1q_s8_x4(q8); q8 += 64;
                                                ^~~~~~~~~~~~~~~~
ggml-quants.c:407:27: error: invalid initializer
 #define ggml_vld1q_s8_x4  vld1q_s8_x4
                           ^
ggml-quants.c:4373:48: note: in expansion of macro ‘ggml_vld1q_s8_x4’
             const ggml_int8x16x4_t q8bytes_2 = ggml_vld1q_s8_x4(q8); q8 += 64;
                                                ^~~~~~~~~~~~~~~~
ggml-quants.c: In function ‘ggml_vec_dot_q4_K_q8_K’:
ggml-quants.c:404:27: error: invalid initializer
 #define ggml_vld1q_u8_x2  vld1q_u8_x2
                           ^
ggml-quants.c:5273:46: note: in expansion of macro ‘ggml_vld1q_u8_x2’
             const ggml_uint8x16x2_t q4bits = ggml_vld1q_u8_x2(q4); q4 += 32;
                                              ^~~~~~~~~~~~~~~~
ggml-quants.c:5291:21: error: incompatible types when assigning to type ‘int8x16x2_t {aka struct int8x16x2_t}’ from type ‘int’
             q8bytes = ggml_vld1q_s8_x2(q8); q8 += 32;
                     ^
ggml-quants.c:5300:21: error: incompatible types when assigning to type ‘int8x16x2_t {aka struct int8x16x2_t}’ from type ‘int’
             q8bytes = ggml_vld1q_s8_x2(q8); q8 += 32;
                     ^
ggml-quants.c: In function ‘ggml_vec_dot_q5_K_q8_K’:
ggml-quants.c:404:27: error: invalid initializer
 #define ggml_vld1q_u8_x2  vld1q_u8_x2
                           ^
ggml-quants.c:5918:36: note: in expansion of macro ‘ggml_vld1q_u8_x2’
         ggml_uint8x16x2_t qhbits = ggml_vld1q_u8_x2(qh);
                                    ^~~~~~~~~~~~~~~~
ggml-quants.c:404:27: error: invalid initializer
 #define ggml_vld1q_u8_x2  vld1q_u8_x2
                           ^
ggml-quants.c:5926:46: note: in expansion of macro ‘ggml_vld1q_u8_x2’
             const ggml_uint8x16x2_t q5bits = ggml_vld1q_u8_x2(q5); q5 += 32;
                                              ^~~~~~~~~~~~~~~~
ggml-quants.c:407:27: error: invalid initializer
 #define ggml_vld1q_s8_x4  vld1q_s8_x4
                           ^
ggml-quants.c:5927:46: note: in expansion of macro ‘ggml_vld1q_s8_x4’
             const ggml_int8x16x4_t q8bytes = ggml_vld1q_s8_x4(q8); q8 += 64;
                                              ^~~~~~~~~~~~~~~~
ggml-quants.c: In function ‘ggml_vec_dot_q6_K_q8_K’:
ggml-quants.c:403:27: error: invalid initializer
 #define ggml_vld1q_s16_x2 vld1q_s16_x2
                           ^
ggml-quants.c:6627:41: note: in expansion of macro ‘ggml_vld1q_s16_x2’
         const ggml_int16x8x2_t q8sums = ggml_vld1q_s16_x2(y[i].bsums);
                                         ^~~~~~~~~~~~~~~~~
ggml-quants.c:6629:43: warning: missing braces around initializer [-Wmissing-braces]
         const ggml_int16x8x2_t q6scales = {vmovl_s8(vget_low_s8(scales)), vmovl_s8(vget_high_s8(scales))};
                                           ^
                                            {                                                            }
ggml-quants.c:404:27: error: invalid initializer
 #define ggml_vld1q_u8_x2  vld1q_u8_x2
                           ^
ggml-quants.c:6641:40: note: in expansion of macro ‘ggml_vld1q_u8_x2’
             ggml_uint8x16x2_t qhbits = ggml_vld1q_u8_x2(qh); qh += 32;
                                        ^~~~~~~~~~~~~~~~
ggml-quants.c:405:27: error: implicit declaration of function ‘vld1q_u8_x4’; did you mean ‘vld1q_u64’? [-Werror=implicit-function-declaration]
 #define ggml_vld1q_u8_x4  vld1q_u8_x4
                           ^
ggml-quants.c:6642:40: note: in expansion of macro ‘ggml_vld1q_u8_x4’
             ggml_uint8x16x4_t q6bits = ggml_vld1q_u8_x4(q6); q6 += 64;
                                        ^~~~~~~~~~~~~~~~
ggml-quants.c:405:27: error: invalid initializer
 #define ggml_vld1q_u8_x4  vld1q_u8_x4
                           ^
ggml-quants.c:6642:40: note: in expansion of macro ‘ggml_vld1q_u8_x4’
             ggml_uint8x16x4_t q6bits = ggml_vld1q_u8_x4(q6); q6 += 64;
                                        ^~~~~~~~~~~~~~~~
ggml-quants.c:407:27: error: invalid initializer
 #define ggml_vld1q_s8_x4  vld1q_s8_x4
                           ^
ggml-quants.c:6643:40: note: in expansion of macro ‘ggml_vld1q_s8_x4’
             ggml_int8x16x4_t q8bytes = ggml_vld1q_s8_x4(q8); q8 += 64;
                                        ^~~~~~~~~~~~~~~~
ggml-quants.c:6686:21: error: incompatible types when assigning to type ‘int8x16x4_t {aka struct int8x16x4_t}’ from type ‘int’
             q8bytes = ggml_vld1q_s8_x4(q8); q8 += 64;
                     ^
cc1: some warnings being treated as errors
Makefile:542: recipe for target 'ggml-quants.o' failed
make: *** [ggml-quants.o] Error 1
$ make LLAMA_CUBLAS=1
I llama.cpp build info: 
I UNAME_S:   Linux
I UNAME_P:   aarch64
I UNAME_M:   aarch64
I CFLAGS:    -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -Wdouble-promotion -pthread -mcpu=native 
I CXXFLAGS:  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -mcpu=native  -Wno-array-bounds -Wno-format-truncation 
I NVCCFLAGS: --forward-unknown-to-host-compiler -use_fast_math -arch=native -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -mcpu=native    -Wno-pedantic -Xcompiler "-Wno-array-bounds -Wno-format-truncation "
I LDFLAGS:   -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib 
I CC:        cc (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04) 7.5.0
I CXX:       g++ (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04) 7.5.0

nvcc --forward-unknown-to-host-compiler -use_fast_math -arch=native -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -mcpu=native    -Wno-pedantic -Xcompiler "-Wno-array-bounds -Wno-format-truncation " -c ggml-cuda.cu -o ggml-cuda.o
nvcc fatal   : Value 'native' is not defined for option 'gpu-architecture'
Makefile:439: recipe for target 'ggml-cuda.o' failed
make: *** [ggml-cuda.o] Error 1

Setting the -arch to compute_53 in the Makefile

make LLAMA_CUBLAS=1
I llama.cpp build info: 
I UNAME_S:   Linux
I UNAME_P:   aarch64
I UNAME_M:   aarch64
I CFLAGS:    -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -Wdouble-promotion -pthread -mcpu=native 
I CXXFLAGS:  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -mcpu=native  -Wno-array-bounds -Wno-format-truncation 
I NVCCFLAGS: --forward-unknown-to-host-compiler -use_fast_math -arch=compute_53 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -mcpu=native    -Wno-pedantic -Xcompiler "-Wno-array-bounds -Wno-format-truncation "
I LDFLAGS:   -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib 
I CC:        cc (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04) 7.5.0
I CXX:       g++ (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04) 7.5.0

nvcc --forward-unknown-to-host-compiler -use_fast_math -arch=compute_53 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -mcpu=native    -Wno-pedantic -Xcompiler "-Wno-array-bounds -Wno-format-truncation " -c ggml-cuda.cu -o ggml-cuda.o
nvcc fatal   : 'cpu=native': expected a number
Makefile:439: recipe for target 'ggml-cuda.o' failed
make: *** [ggml-cuda.o] Error 1

I could not find what number I had to insert here for the cpu variable, and removing it also fails.

make LLAMA_CUBLAS=1
I llama.cpp build info: 
I UNAME_S:   Linux
I UNAME_P:   aarch64
I UNAME_M:   aarch64
I CFLAGS:    -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -Wdouble-promotion -pthread 
I CXXFLAGS:  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation 
I NVCCFLAGS: --forward-unknown-to-host-compiler -use_fast_math -arch=compute_53 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread    -Wno-pedantic -Xcompiler "-Wno-array-bounds -Wno-format-truncation "
I LDFLAGS:   -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib 
I CC:        cc (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04) 7.5.0
I CXX:       g++ (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04) 7.5.0

cc  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -Wdouble-promotion -pthread    -c ggml.c -o ggml.o
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation  -c llama.cpp -o llama.o
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation  -c common/common.cpp -o common.o
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation  -c common/sampling.cpp -o sampling.o
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation  -c common/grammar-parser.cpp -o grammar-parser.o
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation  -c common/build-info.cpp -o build-info.o
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation  -c common/console.cpp -o console.o
nvcc --forward-unknown-to-host-compiler -use_fast_math -arch=compute_53 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread    -Wno-pedantic -Xcompiler "-Wno-array-bounds -Wno-format-truncation " -c ggml-cuda.cu -o ggml-cuda.o
ggml-cuda.cu(5933): error: identifier "CUBLAS_TF32_TENSOR_OP_MATH" is undefined

ggml-cuda.cu(6578): error: identifier "CUBLAS_COMPUTE_16F" is undefined

ggml-cuda.cu(7514): error: identifier "CUBLAS_COMPUTE_16F" is undefined

ggml-cuda.cu(7548): error: identifier "CUBLAS_COMPUTE_16F" is undefined

4 errors detected in the compilation of "/tmp/tmpxft_00001bc3_00000000-6_ggml-cuda.cpp1.ii".
Makefile:439: recipe for target 'ggml-cuda.o' failed
make: *** [ggml-cuda.o] Error 1
$ cmake ..
-- cuBLAS found
-- Using CUDA architectures: 53
GNU ld (GNU Binutils for Ubuntu) 2.30
-- CMAKE_SYSTEM_PROCESSOR: aarch64
-- ARM detected
-- Configuring done (0.4s)
-- Generating done (0.9s)
-- Build files have been written to: /home/rover/llama.cpp/build
cmake --build . --config Release
[  1%] Building C object CMakeFiles/ggml.dir/ggml-quants.c.o
/home/rover/llama.cpp/ggml-quants.c: In function ‘ggml_vec_dot_q2_K_q8_K’:
/home/rover/llama.cpp/ggml-quants.c:403:27: error: implicit declaration of function ‘vld1q_s16_x2’; did you mean ‘vld1q_s16’? [-Werror=implicit-function-declaration]
 #define ggml_vld1q_s16_x2 vld1q_s16_x2
                           ^
/home/rover/llama.cpp/ggml-quants.c:3679:41: note: in expansion of macro ‘ggml_vld1q_s16_x2’
         const ggml_int16x8x2_t q8sums = ggml_vld1q_s16_x2(y[i].bsums);
                                         ^~~~~~~~~~~~~~~~~
/home/rover/llama.cpp/ggml-quants.c:403:27: error: invalid initializer
 #define ggml_vld1q_s16_x2 vld1q_s16_x2
                           ^
/home/rover/llama.cpp/ggml-quants.c:3679:41: note: in expansion of macro ‘ggml_vld1q_s16_x2’
         const ggml_int16x8x2_t q8sums = ggml_vld1q_s16_x2(y[i].bsums);
                                         ^~~~~~~~~~~~~~~~~
/home/rover/llama.cpp/ggml-quants.c:3680:41: warning: missing braces around initializer [-Wmissing-braces]
         const ggml_int16x8x2_t mins16 = {vreinterpretq_s16_u16(vmovl_u8(vget_low_u8(mins))), vreinterpretq_s16_u16(vmovl_u8(vget_high_u8(mins)))};
                                         ^
                                          {                                                                                                      }
/home/rover/llama.cpp/ggml-quants.c:404:27: error: implicit declaration of function ‘vld1q_u8_x2’; did you mean ‘vld1q_u32’? [-Werror=implicit-function-declaration]
 #define ggml_vld1q_u8_x2  vld1q_u8_x2
                           ^
/home/rover/llama.cpp/ggml-quants.c:3716:46: note: in expansion of macro ‘ggml_vld1q_u8_x2’
             const ggml_uint8x16x2_t q2bits = ggml_vld1q_u8_x2(q2); q2 += 32;
                                              ^~~~~~~~~~~~~~~~
/home/rover/llama.cpp/ggml-quants.c:404:27: error: invalid initializer
 #define ggml_vld1q_u8_x2  vld1q_u8_x2
                           ^
/home/rover/llama.cpp/ggml-quants.c:3716:46: note: in expansion of macro ‘ggml_vld1q_u8_x2’
             const ggml_uint8x16x2_t q2bits = ggml_vld1q_u8_x2(q2); q2 += 32;
                                              ^~~~~~~~~~~~~~~~
/home/rover/llama.cpp/ggml-quants.c:406:27: error: implicit declaration of function ‘vld1q_s8_x2’; did you mean ‘vld1q_s32’? [-Werror=implicit-function-declaration]
 #define ggml_vld1q_s8_x2  vld1q_s8_x2
                           ^
/home/rover/llama.cpp/ggml-quants.c:3718:40: note: in expansion of macro ‘ggml_vld1q_s8_x2’
             ggml_int8x16x2_t q8bytes = ggml_vld1q_s8_x2(q8); q8 += 32;
                                        ^~~~~~~~~~~~~~~~
/home/rover/llama.cpp/ggml-quants.c:406:27: error: invalid initializer
 #define ggml_vld1q_s8_x2  vld1q_s8_x2
                           ^
/home/rover/llama.cpp/ggml-quants.c:3718:40: note: in expansion of macro ‘ggml_vld1q_s8_x2’
             ggml_int8x16x2_t q8bytes = ggml_vld1q_s8_x2(q8); q8 += 32;
                                        ^~~~~~~~~~~~~~~~
/home/rover/llama.cpp/ggml-quants.c:3708:17: error: incompatible types when assigning to type ‘int8x16x2_t {aka struct int8x16x2_t}’ from type ‘int’
         q8bytes = ggml_vld1q_s8_x2(q8); q8 += 32;\
                 ^
/home/rover/llama.cpp/ggml-quants.c:3723:13: note: in expansion of macro ‘SHIFT_MULTIPLY_ACCUM_WITH_SCALE’
             SHIFT_MULTIPLY_ACCUM_WITH_SCALE(2, 2);
             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/rover/llama.cpp/ggml-quants.c:3708:17: error: incompatible types when assigning to type ‘int8x16x2_t {aka struct int8x16x2_t}’ from type ‘int’
         q8bytes = ggml_vld1q_s8_x2(q8); q8 += 32;\
                 ^
/home/rover/llama.cpp/ggml-quants.c:3725:13: note: in expansion of macro ‘SHIFT_MULTIPLY_ACCUM_WITH_SCALE’
             SHIFT_MULTIPLY_ACCUM_WITH_SCALE(4, 4);
             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/rover/llama.cpp/ggml-quants.c:3708:17: error: incompatible types when assigning to type ‘int8x16x2_t {aka struct int8x16x2_t}’ from type ‘int’
         q8bytes = ggml_vld1q_s8_x2(q8); q8 += 32;\
                 ^
/home/rover/llama.cpp/ggml-quants.c:3727:13: note: in expansion of macro ‘SHIFT_MULTIPLY_ACCUM_WITH_SCALE’
             SHIFT_MULTIPLY_ACCUM_WITH_SCALE(6, 6);
             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/rover/llama.cpp/ggml-quants.c: In function ‘ggml_vec_dot_q3_K_q8_K’:
/home/rover/llama.cpp/ggml-quants.c:404:27: error: invalid initializer
 #define ggml_vld1q_u8_x2  vld1q_u8_x2
                           ^
/home/rover/llama.cpp/ggml-quants.c:4353:36: note: in expansion of macro ‘ggml_vld1q_u8_x2’
         ggml_uint8x16x2_t qhbits = ggml_vld1q_u8_x2(qh);
                                    ^~~~~~~~~~~~~~~~
/home/rover/llama.cpp/ggml-quants.c:404:27: error: invalid initializer
 #define ggml_vld1q_u8_x2  vld1q_u8_x2
                           ^
/home/rover/llama.cpp/ggml-quants.c:4371:46: note: in expansion of macro ‘ggml_vld1q_u8_x2’
             const ggml_uint8x16x2_t q3bits = ggml_vld1q_u8_x2(q3); q3 += 32;
                                              ^~~~~~~~~~~~~~~~
/home/rover/llama.cpp/ggml-quants.c:407:27: error: implicit declaration of function ‘vld1q_s8_x4’; did you mean ‘vld1q_s64’? [-Werror=implicit-function-declaration]
 #define ggml_vld1q_s8_x4  vld1q_s8_x4
                           ^
/home/rover/llama.cpp/ggml-quants.c:4372:48: note: in expansion of macro ‘ggml_vld1q_s8_x4’
             const ggml_int8x16x4_t q8bytes_1 = ggml_vld1q_s8_x4(q8); q8 += 64;
                                                ^~~~~~~~~~~~~~~~
/home/rover/llama.cpp/ggml-quants.c:407:27: error: invalid initializer
 #define ggml_vld1q_s8_x4  vld1q_s8_x4
                           ^
/home/rover/llama.cpp/ggml-quants.c:4372:48: note: in expansion of macro ‘ggml_vld1q_s8_x4’
             const ggml_int8x16x4_t q8bytes_1 = ggml_vld1q_s8_x4(q8); q8 += 64;
                                                ^~~~~~~~~~~~~~~~
/home/rover/llama.cpp/ggml-quants.c:407:27: error: invalid initializer
 #define ggml_vld1q_s8_x4  vld1q_s8_x4
                           ^
/home/rover/llama.cpp/ggml-quants.c:4373:48: note: in expansion of macro ‘ggml_vld1q_s8_x4’
             const ggml_int8x16x4_t q8bytes_2 = ggml_vld1q_s8_x4(q8); q8 += 64;
                                                ^~~~~~~~~~~~~~~~
/home/rover/llama.cpp/ggml-quants.c: In function ‘ggml_vec_dot_q4_K_q8_K’:
/home/rover/llama.cpp/ggml-quants.c:404:27: error: invalid initializer
 #define ggml_vld1q_u8_x2  vld1q_u8_x2
                           ^
/home/rover/llama.cpp/ggml-quants.c:5273:46: note: in expansion of macro ‘ggml_vld1q_u8_x2’
             const ggml_uint8x16x2_t q4bits = ggml_vld1q_u8_x2(q4); q4 += 32;
                                              ^~~~~~~~~~~~~~~~
/home/rover/llama.cpp/ggml-quants.c:5291:21: error: incompatible types when assigning to type ‘int8x16x2_t {aka struct int8x16x2_t}’ from type ‘int’
             q8bytes = ggml_vld1q_s8_x2(q8); q8 += 32;
                     ^
/home/rover/llama.cpp/ggml-quants.c:5300:21: error: incompatible types when assigning to type ‘int8x16x2_t {aka struct int8x16x2_t}’ from type ‘int’
             q8bytes = ggml_vld1q_s8_x2(q8); q8 += 32;
                     ^
/home/rover/llama.cpp/ggml-quants.c: In function ‘ggml_vec_dot_q5_K_q8_K’:
/home/rover/llama.cpp/ggml-quants.c:404:27: error: invalid initializer
 #define ggml_vld1q_u8_x2  vld1q_u8_x2
                           ^
/home/rover/llama.cpp/ggml-quants.c:5918:36: note: in expansion of macro ‘ggml_vld1q_u8_x2’
         ggml_uint8x16x2_t qhbits = ggml_vld1q_u8_x2(qh);
                                    ^~~~~~~~~~~~~~~~
/home/rover/llama.cpp/ggml-quants.c:404:27: error: invalid initializer
 #define ggml_vld1q_u8_x2  vld1q_u8_x2
                           ^
/home/rover/llama.cpp/ggml-quants.c:5926:46: note: in expansion of macro ‘ggml_vld1q_u8_x2’
             const ggml_uint8x16x2_t q5bits = ggml_vld1q_u8_x2(q5); q5 += 32;
                                              ^~~~~~~~~~~~~~~~
/home/rover/llama.cpp/ggml-quants.c:407:27: error: invalid initializer
 #define ggml_vld1q_s8_x4  vld1q_s8_x4
                           ^
/home/rover/llama.cpp/ggml-quants.c:5927:46: note: in expansion of macro ‘ggml_vld1q_s8_x4’
             const ggml_int8x16x4_t q8bytes = ggml_vld1q_s8_x4(q8); q8 += 64;
                                              ^~~~~~~~~~~~~~~~
/home/rover/llama.cpp/ggml-quants.c: In function ‘ggml_vec_dot_q6_K_q8_K’:
/home/rover/llama.cpp/ggml-quants.c:403:27: error: invalid initializer
 #define ggml_vld1q_s16_x2 vld1q_s16_x2
                           ^
/home/rover/llama.cpp/ggml-quants.c:6627:41: note: in expansion of macro ‘ggml_vld1q_s16_x2’
         const ggml_int16x8x2_t q8sums = ggml_vld1q_s16_x2(y[i].bsums);
                                         ^~~~~~~~~~~~~~~~~
/home/rover/llama.cpp/ggml-quants.c:6629:43: warning: missing braces around initializer [-Wmissing-braces]
         const ggml_int16x8x2_t q6scales = {vmovl_s8(vget_low_s8(scales)), vmovl_s8(vget_high_s8(scales))};
                                           ^
                                            {                                                            }
/home/rover/llama.cpp/ggml-quants.c:404:27: error: invalid initializer
 #define ggml_vld1q_u8_x2  vld1q_u8_x2
                           ^
/home/rover/llama.cpp/ggml-quants.c:6641:40: note: in expansion of macro ‘ggml_vld1q_u8_x2’
             ggml_uint8x16x2_t qhbits = ggml_vld1q_u8_x2(qh); qh += 32;
                                        ^~~~~~~~~~~~~~~~
/home/rover/llama.cpp/ggml-quants.c:405:27: error: implicit declaration of function ‘vld1q_u8_x4’; did you mean ‘vld1q_u64’? [-Werror=implicit-function-declaration]
 #define ggml_vld1q_u8_x4  vld1q_u8_x4
                           ^
/home/rover/llama.cpp/ggml-quants.c:6642:40: note: in expansion of macro ‘ggml_vld1q_u8_x4’
             ggml_uint8x16x4_t q6bits = ggml_vld1q_u8_x4(q6); q6 += 64;
                                        ^~~~~~~~~~~~~~~~
/home/rover/llama.cpp/ggml-quants.c:405:27: error: invalid initializer
 #define ggml_vld1q_u8_x4  vld1q_u8_x4
                           ^
/home/rover/llama.cpp/ggml-quants.c:6642:40: note: in expansion of macro ‘ggml_vld1q_u8_x4’
             ggml_uint8x16x4_t q6bits = ggml_vld1q_u8_x4(q6); q6 += 64;
                                        ^~~~~~~~~~~~~~~~
/home/rover/llama.cpp/ggml-quants.c:407:27: error: invalid initializer
 #define ggml_vld1q_s8_x4  vld1q_s8_x4
                           ^
/home/rover/llama.cpp/ggml-quants.c:6643:40: note: in expansion of macro ‘ggml_vld1q_s8_x4’
             ggml_int8x16x4_t q8bytes = ggml_vld1q_s8_x4(q8); q8 += 64;
                                        ^~~~~~~~~~~~~~~~
/home/rover/llama.cpp/ggml-quants.c:6686:21: error: incompatible types when assigning to type ‘int8x16x4_t {aka struct int8x16x4_t}’ from type ‘int’
             q8bytes = ggml_vld1q_s8_x4(q8); q8 += 64;
                     ^
cc1: some warnings being treated as errors
CMakeFiles/ggml.dir/build.make:117: recipe for target 'CMakeFiles/ggml.dir/ggml-quants.c.o' failed
make[2]: *** [CMakeFiles/ggml.dir/ggml-quants.c.o] Error 1
CMakeFiles/Makefile2:623: recipe for target 'CMakeFiles/ggml.dir/all' failed
make[1]: *** [CMakeFiles/ggml.dir/all] Error 2
Makefile:145: recipe for target 'all' failed
make: *** [all] Error 2
$ cmake .. -DLLAMA_CUBLAS=ON
-- cuBLAS found
-- Using CUDA architectures: 53
GNU ld (GNU Binutils for Ubuntu) 2.30
-- CMAKE_SYSTEM_PROCESSOR: aarch64
-- ARM detected
-- Configuring done (0.3s)
-- Generating done (0.3s)
-- Build files have been written to: /home/rover/llama.cpp/build

$ cmake --build . --config Release
[  1%] Building C object CMakeFiles/ggml.dir/ggml-quants.c.o
/home/rover/llama.cpp/ggml-quants.c: In function ‘ggml_vec_dot_q2_K_q8_K’:
/home/rover/llama.cpp/ggml-quants.c:403:27: error: implicit declaration of function ‘vld1q_s16_x2’; did you mean ‘vld1q_s16’? [-Werror=implicit-function-declaration]
 #define ggml_vld1q_s16_x2 vld1q_s16_x2
                           ^
/home/rover/llama.cpp/ggml-quants.c:3679:41: note: in expansion of macro ‘ggml_vld1q_s16_x2’
         const ggml_int16x8x2_t q8sums = ggml_vld1q_s16_x2(y[i].bsums);
                                         ^~~~~~~~~~~~~~~~~
/home/rover/llama.cpp/ggml-quants.c:403:27: error: invalid initializer
 #define ggml_vld1q_s16_x2 vld1q_s16_x2
                           ^
/home/rover/llama.cpp/ggml-quants.c:3679:41: note: in expansion of macro ‘ggml_vld1q_s16_x2’
         const ggml_int16x8x2_t q8sums = ggml_vld1q_s16_x2(y[i].bsums);
                                         ^~~~~~~~~~~~~~~~~
/home/rover/llama.cpp/ggml-quants.c:3680:41: warning: missing braces around initializer [-Wmissing-braces]
         const ggml_int16x8x2_t mins16 = {vreinterpretq_s16_u16(vmovl_u8(vget_low_u8(mins))), vreinterpretq_s16_u16(vmovl_u8(vget_high_u8(mins)))};
                                         ^
                                          {                                                                                                      }
/home/rover/llama.cpp/ggml-quants.c:404:27: error: implicit declaration of function ‘vld1q_u8_x2’; did you mean ‘vld1q_u32’? [-Werror=implicit-function-declaration]
 #define ggml_vld1q_u8_x2  vld1q_u8_x2
                           ^
/home/rover/llama.cpp/ggml-quants.c:3716:46: note: in expansion of macro ‘ggml_vld1q_u8_x2’
             const ggml_uint8x16x2_t q2bits = ggml_vld1q_u8_x2(q2); q2 += 32;
                                              ^~~~~~~~~~~~~~~~
/home/rover/llama.cpp/ggml-quants.c:404:27: error: invalid initializer
 #define ggml_vld1q_u8_x2  vld1q_u8_x2
                           ^
/home/rover/llama.cpp/ggml-quants.c:3716:46: note: in expansion of macro ‘ggml_vld1q_u8_x2’
             const ggml_uint8x16x2_t q2bits = ggml_vld1q_u8_x2(q2); q2 += 32;
                                              ^~~~~~~~~~~~~~~~
/home/rover/llama.cpp/ggml-quants.c:406:27: error: implicit declaration of function ‘vld1q_s8_x2’; did you mean ‘vld1q_s32’? [-Werror=implicit-function-declaration]
 #define ggml_vld1q_s8_x2  vld1q_s8_x2
                           ^
/home/rover/llama.cpp/ggml-quants.c:3718:40: note: in expansion of macro ‘ggml_vld1q_s8_x2’
             ggml_int8x16x2_t q8bytes = ggml_vld1q_s8_x2(q8); q8 += 32;
                                        ^~~~~~~~~~~~~~~~
/home/rover/llama.cpp/ggml-quants.c:406:27: error: invalid initializer
 #define ggml_vld1q_s8_x2  vld1q_s8_x2
                           ^
/home/rover/llama.cpp/ggml-quants.c:3718:40: note: in expansion of macro ‘ggml_vld1q_s8_x2’
             ggml_int8x16x2_t q8bytes = ggml_vld1q_s8_x2(q8); q8 += 32;
                                        ^~~~~~~~~~~~~~~~
/home/rover/llama.cpp/ggml-quants.c:3708:17: error: incompatible types when assigning to type ‘int8x16x2_t {aka struct int8x16x2_t}’ from type ‘int’
         q8bytes = ggml_vld1q_s8_x2(q8); q8 += 32;\
                 ^
/home/rover/llama.cpp/ggml-quants.c:3723:13: note: in expansion of macro ‘SHIFT_MULTIPLY_ACCUM_WITH_SCALE’
             SHIFT_MULTIPLY_ACCUM_WITH_SCALE(2, 2);
             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/rover/llama.cpp/ggml-quants.c:3708:17: error: incompatible types when assigning to type ‘int8x16x2_t {aka struct int8x16x2_t}’ from type ‘int’
         q8bytes = ggml_vld1q_s8_x2(q8); q8 += 32;\
                 ^
/home/rover/llama.cpp/ggml-quants.c:3725:13: note: in expansion of macro ‘SHIFT_MULTIPLY_ACCUM_WITH_SCALE’
             SHIFT_MULTIPLY_ACCUM_WITH_SCALE(4, 4);
             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/rover/llama.cpp/ggml-quants.c:3708:17: error: incompatible types when assigning to type ‘int8x16x2_t {aka struct int8x16x2_t}’ from type ‘int’
         q8bytes = ggml_vld1q_s8_x2(q8); q8 += 32;\
                 ^
/home/rover/llama.cpp/ggml-quants.c:3727:13: note: in expansion of macro ‘SHIFT_MULTIPLY_ACCUM_WITH_SCALE’
             SHIFT_MULTIPLY_ACCUM_WITH_SCALE(6, 6);
             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/rover/llama.cpp/ggml-quants.c: In function ‘ggml_vec_dot_q3_K_q8_K’:
/home/rover/llama.cpp/ggml-quants.c:404:27: error: invalid initializer
 #define ggml_vld1q_u8_x2  vld1q_u8_x2
                           ^
/home/rover/llama.cpp/ggml-quants.c:4353:36: note: in expansion of macro ‘ggml_vld1q_u8_x2’
         ggml_uint8x16x2_t qhbits = ggml_vld1q_u8_x2(qh);
                                    ^~~~~~~~~~~~~~~~
/home/rover/llama.cpp/ggml-quants.c:404:27: error: invalid initializer
 #define ggml_vld1q_u8_x2  vld1q_u8_x2
                           ^
/home/rover/llama.cpp/ggml-quants.c:4371:46: note: in expansion of macro ‘ggml_vld1q_u8_x2’
             const ggml_uint8x16x2_t q3bits = ggml_vld1q_u8_x2(q3); q3 += 32;
                                              ^~~~~~~~~~~~~~~~
/home/rover/llama.cpp/ggml-quants.c:407:27: error: implicit declaration of function ‘vld1q_s8_x4’; did you mean ‘vld1q_s64’? [-Werror=implicit-function-declaration]
 #define ggml_vld1q_s8_x4  vld1q_s8_x4
                           ^
/home/rover/llama.cpp/ggml-quants.c:4372:48: note: in expansion of macro ‘ggml_vld1q_s8_x4’
             const ggml_int8x16x4_t q8bytes_1 = ggml_vld1q_s8_x4(q8); q8 += 64;
                                                ^~~~~~~~~~~~~~~~
/home/rover/llama.cpp/ggml-quants.c:407:27: error: invalid initializer
 #define ggml_vld1q_s8_x4  vld1q_s8_x4
                           ^
/home/rover/llama.cpp/ggml-quants.c:4372:48: note: in expansion of macro ‘ggml_vld1q_s8_x4’
             const ggml_int8x16x4_t q8bytes_1 = ggml_vld1q_s8_x4(q8); q8 += 64;
                                                ^~~~~~~~~~~~~~~~
/home/rover/llama.cpp/ggml-quants.c:407:27: error: invalid initializer
 #define ggml_vld1q_s8_x4  vld1q_s8_x4
                           ^
/home/rover/llama.cpp/ggml-quants.c:4373:48: note: in expansion of macro ‘ggml_vld1q_s8_x4’
             const ggml_int8x16x4_t q8bytes_2 = ggml_vld1q_s8_x4(q8); q8 += 64;
                                                ^~~~~~~~~~~~~~~~
/home/rover/llama.cpp/ggml-quants.c: In function ‘ggml_vec_dot_q4_K_q8_K’:
/home/rover/llama.cpp/ggml-quants.c:404:27: error: invalid initializer
 #define ggml_vld1q_u8_x2  vld1q_u8_x2
                           ^
/home/rover/llama.cpp/ggml-quants.c:5273:46: note: in expansion of macro ‘ggml_vld1q_u8_x2’
             const ggml_uint8x16x2_t q4bits = ggml_vld1q_u8_x2(q4); q4 += 32;
                                              ^~~~~~~~~~~~~~~~
/home/rover/llama.cpp/ggml-quants.c:5291:21: error: incompatible types when assigning to type ‘int8x16x2_t {aka struct int8x16x2_t}’ from type ‘int’
             q8bytes = ggml_vld1q_s8_x2(q8); q8 += 32;
                     ^
/home/rover/llama.cpp/ggml-quants.c:5300:21: error: incompatible types when assigning to type ‘int8x16x2_t {aka struct int8x16x2_t}’ from type ‘int’
             q8bytes = ggml_vld1q_s8_x2(q8); q8 += 32;
                     ^
/home/rover/llama.cpp/ggml-quants.c: In function ‘ggml_vec_dot_q5_K_q8_K’:
/home/rover/llama.cpp/ggml-quants.c:404:27: error: invalid initializer
 #define ggml_vld1q_u8_x2  vld1q_u8_x2
                           ^
/home/rover/llama.cpp/ggml-quants.c:5918:36: note: in expansion of macro ‘ggml_vld1q_u8_x2’
         ggml_uint8x16x2_t qhbits = ggml_vld1q_u8_x2(qh);
                                    ^~~~~~~~~~~~~~~~
/home/rover/llama.cpp/ggml-quants.c:404:27: error: invalid initializer
 #define ggml_vld1q_u8_x2  vld1q_u8_x2
                           ^
/home/rover/llama.cpp/ggml-quants.c:5926:46: note: in expansion of macro ‘ggml_vld1q_u8_x2’
             const ggml_uint8x16x2_t q5bits = ggml_vld1q_u8_x2(q5); q5 += 32;
                                              ^~~~~~~~~~~~~~~~
/home/rover/llama.cpp/ggml-quants.c:407:27: error: invalid initializer
 #define ggml_vld1q_s8_x4  vld1q_s8_x4
                           ^
/home/rover/llama.cpp/ggml-quants.c:5927:46: note: in expansion of macro ‘ggml_vld1q_s8_x4’
             const ggml_int8x16x4_t q8bytes = ggml_vld1q_s8_x4(q8); q8 += 64;
                                              ^~~~~~~~~~~~~~~~
/home/rover/llama.cpp/ggml-quants.c: In function ‘ggml_vec_dot_q6_K_q8_K’:
/home/rover/llama.cpp/ggml-quants.c:403:27: error: invalid initializer
 #define ggml_vld1q_s16_x2 vld1q_s16_x2
                           ^
/home/rover/llama.cpp/ggml-quants.c:6627:41: note: in expansion of macro ‘ggml_vld1q_s16_x2’
         const ggml_int16x8x2_t q8sums = ggml_vld1q_s16_x2(y[i].bsums);
                                         ^~~~~~~~~~~~~~~~~
/home/rover/llama.cpp/ggml-quants.c:6629:43: warning: missing braces around initializer [-Wmissing-braces]
         const ggml_int16x8x2_t q6scales = {vmovl_s8(vget_low_s8(scales)), vmovl_s8(vget_high_s8(scales))};
                                           ^
                                            {                                                            }
/home/rover/llama.cpp/ggml-quants.c:404:27: error: invalid initializer
 #define ggml_vld1q_u8_x2  vld1q_u8_x2
                           ^
/home/rover/llama.cpp/ggml-quants.c:6641:40: note: in expansion of macro ‘ggml_vld1q_u8_x2’
             ggml_uint8x16x2_t qhbits = ggml_vld1q_u8_x2(qh); qh += 32;
                                        ^~~~~~~~~~~~~~~~
/home/rover/llama.cpp/ggml-quants.c:405:27: error: implicit declaration of function ‘vld1q_u8_x4’; did you mean ‘vld1q_u64’? [-Werror=implicit-function-declaration]
 #define ggml_vld1q_u8_x4  vld1q_u8_x4
                           ^
/home/rover/llama.cpp/ggml-quants.c:6642:40: note: in expansion of macro ‘ggml_vld1q_u8_x4’
             ggml_uint8x16x4_t q6bits = ggml_vld1q_u8_x4(q6); q6 += 64;
                                        ^~~~~~~~~~~~~~~~
/home/rover/llama.cpp/ggml-quants.c:405:27: error: invalid initializer
 #define ggml_vld1q_u8_x4  vld1q_u8_x4
                           ^
/home/rover/llama.cpp/ggml-quants.c:6642:40: note: in expansion of macro ‘ggml_vld1q_u8_x4’
             ggml_uint8x16x4_t q6bits = ggml_vld1q_u8_x4(q6); q6 += 64;
                                        ^~~~~~~~~~~~~~~~
/home/rover/llama.cpp/ggml-quants.c:407:27: error: invalid initializer
 #define ggml_vld1q_s8_x4  vld1q_s8_x4
                           ^
/home/rover/llama.cpp/ggml-quants.c:6643:40: note: in expansion of macro ‘ggml_vld1q_s8_x4’
             ggml_int8x16x4_t q8bytes = ggml_vld1q_s8_x4(q8); q8 += 64;
                                        ^~~~~~~~~~~~~~~~
/home/rover/llama.cpp/ggml-quants.c:6686:21: error: incompatible types when assigning to type ‘int8x16x4_t {aka struct int8x16x4_t}’ from type ‘int’
             q8bytes = ggml_vld1q_s8_x4(q8); q8 += 64;
                     ^
cc1: some warnings being treated as errors
CMakeFiles/ggml.dir/build.make:117: recipe for target 'CMakeFiles/ggml.dir/ggml-quants.c.o' failed
make[2]: *** [CMakeFiles/ggml.dir/ggml-quants.c.o] Error 1
CMakeFiles/Makefile2:623: recipe for target 'CMakeFiles/ggml.dir/all' failed
make[1]: *** [CMakeFiles/ggml.dir/all] Error 2
Makefile:145: recipe for target 'all' failed
make: *** [all] Error 2

Similar error to #3880,

I'm not sure what more to do, or if it is even supported, because the Nvidia Jetson Nano is running Ubuntu 18.04 with cuda 10.2, which is older but I cannot upgrade. Would really appreciate if someone could help me figure this out, if I need to provide more information let me know!

rvandernoort avatar Nov 16 '23 11:11 rvandernoort

What kind of Jetson board are you using ? I might have some Jetsons at work to try on, but I want to check if your and my boards have the same hardware

staviq avatar Nov 16 '23 14:11 staviq

I'm using this board: https://developer.nvidia.com/embedded/jetson-nano-developer-kit

I would very much appreciate the help!

rvandernoort avatar Nov 16 '23 20:11 rvandernoort

i'm in this same boat right now with the latest cuda-10.2 (from apt install nvidia-jetpack) the files are all present in /usr/local/cuda-10.2/ but getting the same errors.

I contemplated just doing the ubuntu repository driver installs instead of the jetpack specific variants just to see if it's functional even.

VestigeJ avatar Nov 17 '23 03:11 VestigeJ

I'm using this board: https://developer.nvidia.com/embedded/jetson-nano-developer-kit

I would very much appreciate the help!

I'm pretty sure I have exact same ones at work, so I'm gonna try to borrow one, might take couple of days.

staviq avatar Nov 17 '23 12:11 staviq

Potentially related?: https://github.com/ggerganov/llama.cpp/issues/4123

EDIT: I've run it with the patch file from this issue, but get some new errors, which I don't know how to solve. Did anyone else make some progress?

make LLAMA_CUBLAS=1
I llama.cpp build info: 
I UNAME_S:   Linux
I UNAME_P:   aarch64
I UNAME_M:   aarch64
I CFLAGS:    -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -Wdouble-promotion -pthread -mcpu=native 
I CXXFLAGS:  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -mcpu=native  -Wno-array-bounds -Wno-format-truncation 
I NVCCFLAGS: --compiler-options="  " -use_fast_math -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128
I LDFLAGS:   -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib 
I CC:        cc (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04) 7.5.0
I CXX:       g++ (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04) 7.5.0

cc  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -Wdouble-promotion -pthread -mcpu=native    -c ggml.c -o ggml.o
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -mcpu=native  -Wno-array-bounds -Wno-format-truncation  -c llama.cpp -o llama.o
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -mcpu=native  -Wno-array-bounds -Wno-format-truncation  -c common/common.cpp -o common.o
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -mcpu=native  -Wno-array-bounds -Wno-format-truncation  -c common/sampling.cpp -o sampling.o
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -mcpu=native  -Wno-array-bounds -Wno-format-truncation  -c common/grammar-parser.cpp -o grammar-parser.o
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -mcpu=native  -Wno-array-bounds -Wno-format-truncation  -c common/build-info.cpp -o build-info.o
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -mcpu=native  -Wno-array-bounds -Wno-format-truncation  -c common/console.cpp -o console.o
nvcc --compiler-options="  " -use_fast_math -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -c ggml-cuda.cu -o ggml-cuda.o
ggml-cuda.cu(5970): error: identifier "CUBLAS_TF32_TENSOR_OP_MATH" is undefined

ggml-cuda.cu(6617): error: identifier "CUBLAS_COMPUTE_16F" is undefined

ggml-cuda.cu(7552): error: identifier "CUBLAS_COMPUTE_16F" is undefined

ggml-cuda.cu(7586): error: identifier "CUBLAS_COMPUTE_16F" is undefined

4 errors detected in the compilation of "/tmp/tmpxft_00002e62_00000000-6_ggml-cuda.cpp1.ii".
Makefile:440: recipe for target 'ggml-cuda.o' failed
make: *** [ggml-cuda.o] Error 1

rvandernoort avatar Nov 21 '23 11:11 rvandernoort

@staviq any luck so far on your end?

rvandernoort avatar Dec 04 '23 10:12 rvandernoort

@rvandernoort nothing conclusive, installing newer cuda (which would solve the problem) on jetson is hacky and unstable

Supposedly Nvidia will support more recent cuda in the next release of their jetson software bundle

staviq avatar Dec 04 '23 11:12 staviq

You need to provide CUDA_DOCKER_ARCH to make it work. E.g., make LLAMA_CUBLAS=1 CUDA_DOCKER_ARCH=sm_87 LLAMA_CUDA_F16=1 -j 10 for Jetson Orin or make LLAMA_CUBLAS=1 CUDA_DOCKER_ARCH=sm_72 LLAMA_CUDA_F16=1 -j 10 for Jetson Xavier.

vvsotnikov avatar Dec 13 '23 15:12 vvsotnikov

You need to provide CUDA_DOCKER_ARCH to make it work. E.g., make LLAMA_CUBLAS=1 CUDA_DOCKER_ARCH=sm_87 LLAMA_CUDA_F16=1 -j 10 for Jetson Orin or make LLAMA_CUBLAS=1 CUDA_DOCKER_ARCH=sm_72 LLAMA_CUDA_F16=1 -j 10 for Jetson Xavier.

This does not work on my Jetson Xavier, I'm still getting error: identifier "CUBLAS_COMPUTE_16F" is undefined.

eternitybt avatar Dec 14 '23 19:12 eternitybt

@eternitybt are you using CUDA 10? If so, updating it should fix the problem (at least it's now working on my Xavier NX 16GB).

vvsotnikov avatar Dec 14 '23 19:12 vvsotnikov

@vvsotnikov I am indeed still on CUDA 10. It's great to hear that upgrading to CUDA 11 is possible on the Xavier NX from someone who has actually done it! Maybe you have a link to the instructions that worked for you? I find a lot of conflicting information on the internet and am afraid of trying out anything because I do not want to mess up my system.

eternitybt avatar Dec 15 '23 00:12 eternitybt

@eternitybt I haven't updated it per se, I just installed JetPack 5.1.2 from scratch, it includes CUDA 11. If you could afford wiping your Jetson, that's probably the easiest solution

vvsotnikov avatar Dec 15 '23 00:12 vvsotnikov

I can confirm that after updating to JetPack 5.1.2 and using the command make LLAMA_CUBLAS=1 CUDA_DOCKER_ARCH=sm_72 LLAMA_CUDA_F16=1 -j 10 provided by @vvsotnikov I can successfully compile and run the newest version of llama.cpp on my Jetson Xavier NX. Big thanks!!

eternitybt avatar Dec 16 '23 11:12 eternitybt

Compiling with gcc-10 and g++-10 is successful, which is the highest available version on Jetson nano developer kit.

armx40 avatar Dec 17 '23 06:12 armx40

https://github.com/ggerganov/llama.cpp/issues/4123#issuecomment-1865614568 check this,I solved the compilation problem.

FantasyGmm avatar Dec 21 '23 06:12 FantasyGmm

Thanks for all the suggestions, and to conclude how to install on an Nvidia Jetson Nano with ubuntu 18, cuda10.2

  1. Compile and install gcc 8.5 from source and export it to CC and CCX
  2. Remove this line #MK_CXXFLAGS += -mcpu=native from Makefile
  3. Run make LLAMA_CUBLAS=1 CUDA_DOCKER_ARCH=sm_52

rvandernoort avatar Jan 11 '24 14:01 rvandernoort

@rvandernoort

Why do we need to compile directly from the source code of GCC?

I have similar issue in ggml and I can install gcc-8 and gcc-9 from apt-get by

sudo add-apt-repository ppa:ubuntu-toolchain-r/test
sudo apt update
sudo apt install gcc-9 g++-9

and to build ggml

cmake -DCMAKE_C_COMPILER=/usr/bin/gcc-9 -DCMAKE_CXX_COMPILER=/usr/bin/g++-9 ..

I hope this can help others who encounter the same issue on the Jetson Nano.

twmht avatar Jan 12 '24 02:01 twmht

@rvandernoort

Why do we need to compile directly from the source code of GCC?

I have similar issue in ggml and I can install gcc-8 and gcc-9 from apt-get by


sudo add-apt-repository ppa:ubuntu-toolchain-r/test

sudo apt update

sudo apt install gcc-9 g++-9

and to build ggml


cmake -DCMAKE_C_COMPILER=/usr/bin/gcc-9 -DCMAKE_CXX_COMPILER=/usr/bin/g++-9 ..

I hope this can help others who encounter the same issue on the Jetson Nano.

There is a version restriction on using gcc from jetpack4 to cuda10, and in tx2 I can only use gcc8 and below, and the Ubuntu source gcc lacks these arm functions, so you need to build from source

FantasyGmm avatar Jan 13 '24 13:01 FantasyGmm

@rvandernoort What kind of performance do you get out of this?

In my case running it without GPU offloading is actually faster than with it. Do you observe a similar pattern for your Jetson Nano?

PaulDrm avatar Jan 31 '24 23:01 PaulDrm

@rvandernoort What kind of performance do you get out of this?

In my case running it without GPU offloading is actually faster than with it. Do you observe a similar pattern for your Jetson Nano?

I've done a small test to confirm your hypothesis, but for me the loading time of the model is similar and the GPU inference on a 4-bit quantized model is 3-4 times faster compared to the CPU inference using the server.

An issue I did encounter was very slow behaviour when you are closing the RAM max, so maybe model size is the issue?

rvandernoort avatar Feb 02 '24 11:02 rvandernoort

@rvandernoort What kind of performance do you get out of this? In my case running it without GPU offloading is actually faster than with it. Do you observe a similar pattern for your Jetson Nano?

I've done a small test to confirm your hypothesis, but for me the loading time of the model is similar and the GPU inference on a 4-bit quantized model is 3-4 times faster compared to the CPU inference using the server.

An issue I did encounter was very slow behaviour when you are closing the RAM max, so maybe model size is the issue?

I just read your comment sorry. I am running Phi-2 so it should not be a problem with RAM. weird. :/

PaulDrm avatar Feb 20 '24 17:02 PaulDrm

Closing this as my project with the jetson has finished and I managed to compile.

rvandernoort avatar Mar 24 '24 11:03 rvandernoort

If anyone else needs to run llama.cpp on a Jetson Nano, I compiled all the advice listed here needed to get it running into a github gist.
Have fun!

FlorSanders avatar Apr 11 '24 15:04 FlorSanders