llama.cpp
llama.cpp copied to clipboard
builds on jetson orin failure
both llama.cpp with : % cmake .. -DLLAMA_CUBLAS=ON -DLLAMA_CUDA_DMMV_F16=ON -DLLAMA_CUDA_DMMV_Y=16 and in koboldcpp with : % cmake .. -DLLAMA_CUBLAS=1 give : ggml.h(218): error: identifier "__fp16" is undefined
I came here just for this: Exact same problem on AGX Orin JP 5.1.1 L4T 35.3.1
/usr/src/llama.cpp/ggml.h(218): error: identifier "__fp16" is undefined
Ahhhh, cortex 8+ processors no longer support neon, the library must be fully x64. They can support x32 but only when running within an x32 operating system / kernel.
@malv-c If you replace __fp16 with uint16_t on line 218 of ggml.h the project builds and cuBLAS works without issue.
Even though this successfully builds, it does seem to be attempting to use NEON, I am unsure if this will have a performance impact...
llama.cpp: loading model from /opt/gpt-models/vicuna-7b-1.1.ggmlv3.q8_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 7 (mostly Q8_0)
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 0.07 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required = 1924.88 MB (+ 1026.00 MB per state)
llama_model_load_internal: allocating batch_size x 1 MB = 512 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 32 repeating layers to GPU
llama_model_load_internal: offloading non-repeating layers to GPU
llama_model_load_internal: offloading v cache to GPU
llama_model_load_internal: offloading k cache to GPU
llama_model_load_internal: offloaded 35/35 layers to GPU
llama_model_load_internal: total VRAM used: 8234 MB
llama_new_context_with_model: kv self size = 256.00 MB
AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
llama_print_timings: load time = 593.09 ms
Does this thread help? https://github.com/ggerganov/llama.cpp/issues/1455
Oh! This here looks like it might be the silver bullet: https://github.com/ggerganov/llama.cpp/issues/1455#issuecomment-1555761710
thans Kevin
Le lun. 26 juin 2023 à 21:24, Kevin Eales @.***> a écrit :
@malv-c https://github.com/malv-c If you replace __fp16 with uint16_t on line 218 of ggml.h the project builds and cuBLAS works without issue.
— Reply to this email directly, view it on GitHub https://github.com/ggerganov/llama.cpp/issues/2004#issuecomment-1608100523, or unsubscribe https://github.com/notifications/unsubscribe-auth/AESIHJIZFLPGTIMX5FZKISTXNHOYTANCNFSM6AAAAAAZUGDJ4M . You are receiving this because you were mentioned.Message ID: @.***>
This issue was closed because it has been inactive for 14 days since being marked as stale.