llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

builds on jetson orin failure

Open malv-c opened this issue 1 year ago • 2 comments

both llama.cpp with : % cmake .. -DLLAMA_CUBLAS=ON -DLLAMA_CUDA_DMMV_F16=ON -DLLAMA_CUDA_DMMV_Y=16 and in koboldcpp with : % cmake .. -DLLAMA_CUBLAS=1 give : ggml.h(218): error: identifier "__fp16" is undefined

malv-c avatar Jun 26 '23 13:06 malv-c

I came here just for this: Exact same problem on AGX Orin JP 5.1.1 L4T 35.3.1

/usr/src/llama.cpp/ggml.h(218): error: identifier "__fp16" is undefined

manbehindthemadness avatar Jun 26 '23 15:06 manbehindthemadness

image Ahhhh, cortex 8+ processors no longer support neon, the library must be fully x64. They can support x32 but only when running within an x32 operating system / kernel.

manbehindthemadness avatar Jun 26 '23 16:06 manbehindthemadness

@malv-c If you replace __fp16 with uint16_t on line 218 of ggml.h the project builds and cuBLAS works without issue.

manbehindthemadness avatar Jun 26 '23 19:06 manbehindthemadness

Even though this successfully builds, it does seem to be attempting to use NEON, I am unsure if this will have a performance impact...

llama.cpp: loading model from /opt/gpt-models/vicuna-7b-1.1.ggmlv3.q8_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 7 (mostly Q8_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required  = 1924.88 MB (+ 1026.00 MB per state)
llama_model_load_internal: allocating batch_size x 1 MB = 512 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 32 repeating layers to GPU
llama_model_load_internal: offloading non-repeating layers to GPU
llama_model_load_internal: offloading v cache to GPU
llama_model_load_internal: offloading k cache to GPU
llama_model_load_internal: offloaded 35/35 layers to GPU
llama_model_load_internal: total VRAM used: 8234 MB
llama_new_context_with_model: kv self size  =  256.00 MB
AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 
llama_print_timings:        load time =   593.09 ms

manbehindthemadness avatar Jun 26 '23 21:06 manbehindthemadness

Does this thread help? https://github.com/ggerganov/llama.cpp/issues/1455

swittk avatar Jun 26 '23 23:06 swittk

Oh! This here looks like it might be the silver bullet: https://github.com/ggerganov/llama.cpp/issues/1455#issuecomment-1555761710

manbehindthemadness avatar Jun 26 '23 23:06 manbehindthemadness

thans Kevin

Le lun. 26 juin 2023 à 21:24, Kevin Eales @.***> a écrit :

@malv-c https://github.com/malv-c If you replace __fp16 with uint16_t on line 218 of ggml.h the project builds and cuBLAS works without issue.

— Reply to this email directly, view it on GitHub https://github.com/ggerganov/llama.cpp/issues/2004#issuecomment-1608100523, or unsubscribe https://github.com/notifications/unsubscribe-auth/AESIHJIZFLPGTIMX5FZKISTXNHOYTANCNFSM6AAAAAAZUGDJ4M . You are receiving this because you were mentioned.Message ID: @.***>

malv-c avatar Jun 27 '23 06:06 malv-c

This issue was closed because it has been inactive for 14 days since being marked as stale.

github-actions[bot] avatar Apr 10 '24 01:04 github-actions[bot]