llamafile
llamafile copied to clipboard
Bug: Huge difference between prompt processing (tokens/sec) compared to Llama cpp or Ollama
What happened?
For llama cpp I had downloaded the q4_k_m quantized model and used llama-bench.
For ollama I pulled the q4_k_m model from ollama. By running the model with --verbose flag, I manually recorded the prompt eval rate for 10 trials with same prompt of approximately 512 tokens length.
For llamafile I used the same model as used for llama cpp and created a llamafile and then benchmarked with llamafile-bench.
llama-bench logs:
llama-bench -m "gguf/llama-3.2-1b-q4_k_m.gguf" -p 512 -n 1024 -ngl 17 --verbose
| model | size | params | backend | ngl | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
llama_model_loader: loaded meta data with 29 key-value pairs and 147 tensors from gguf/llama-3.2-1b-q4_k_m.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Llama 3.2 1B
llama_model_loader: - kv 3: general.basename str = Llama-3.2
llama_model_loader: - kv 4: general.size_label str = 1B
llama_model_loader: - kv 5: general.license str = llama3.2
llama_model_loader: - kv 6: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv 7: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv 8: llama.block_count u32 = 16
llama_model_loader: - kv 9: llama.context_length u32 = 131072
llama_model_loader: - kv 10: llama.embedding_length u32 = 2048
llama_model_loader: - kv 11: llama.feed_forward_length u32 = 8192
llama_model_loader: - kv 12: llama.attention.head_count u32 = 32
llama_model_loader: - kv 13: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 14: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 15: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 16: llama.attention.key_length u32 = 64
llama_model_loader: - kv 17: llama.attention.value_length u32 = 64
llama_model_loader: - kv 18: general.file_type u32 = 15
llama_model_loader: - kv 19: llama.vocab_size u32 = 128256
llama_model_loader: - kv 20: llama.rope.dimension_count u32 = 64
llama_model_loader: - kv 21: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 22: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 25: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 128001
llama_model_loader: - kv 28: general.quantization_version u32 = 2
llama_model_loader: - type f32: 34 tensors
llama_model_loader: - type q4_K: 96 tensors
llama_model_loader: - type q6_K: 17 tensors
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 2048
llm_load_print_meta: n_layer = 16
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 64
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 64
llm_load_print_meta: n_embd_head_v = 64
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: n_embd_k_gqa = 512
llm_load_print_meta: n_embd_v_gqa = 512
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 8192
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = ?B
llm_load_print_meta: model ftype = Q4_K - Medium
llm_load_print_meta: model params = 1.24 B
llm_load_print_meta: model size = 762.81 MiB (5.18 BPW)
llm_load_print_meta: general.name = Llama 3.2 1B
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128001 '<|end_of_text|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128001 '<|end_of_text|>'
llm_load_print_meta: EOG token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size = 0.14 MiB
ggml_backend_metal_log_allocated_size: allocated buffer, size = 762.83 MiB, ( 762.91 / 27648.00)
llm_load_tensors: offloading 16 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 17/17 layers to GPU
llm_load_tensors: CPU buffer size = 205.49 MiB
llm_load_tensors: Metal buffer size = 762.82 MiB
.......................................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M3 Max
ggml_metal_init: picking default device: Apple M3 Max
ggml_metal_init: using embedded metal library
ggml_metal_init: GPU name: Apple M3 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple9 (1009)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001)
ggml_metal_init: simdgroup reduction support = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: recommendedMaxWorkingSetSize = 28991.03 MB
llama_kv_cache_init: Metal KV buffer size = 16.00 MiB
llama_new_context_with_model: KV self size = 16.00 MiB, K (f16): 8.00 MiB, V (f16): 8.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.49 MiB
llama_new_context_with_model: Metal compute buffer size = 254.50 MiB
llama_new_context_with_model: CPU compute buffer size = 5.01 MiB
llama_new_context_with_model: graph nodes = 518
llama_new_context_with_model: graph splits = 2
| llama ?B Q4_K - Medium | 968.30 MiB | 1.50 B | Metal | 17 | pp512 | 3177.19 ± 12.58 |
llama_perf_context_print: load time = 503.11 ms
llama_perf_context_print: prompt eval time = 0.00 ms / 3072 tokens ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: total time = 1309.45 ms / 3073 tokens
ggml_metal_free: deallocating
llama_new_context_with_model: n_ctx = 1024
llama_new_context_with_model: n_batch = 1024
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M3 Max
ggml_metal_init: picking default device: Apple M3 Max
ggml_metal_init: using embedded metal library
ggml_metal_init: GPU name: Apple M3 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple9 (1009)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001)
ggml_metal_init: simdgroup reduction support = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: recommendedMaxWorkingSetSize = 28991.03 MB
llama_kv_cache_init: Metal KV buffer size = 32.00 MiB
llama_new_context_with_model: KV self size = 32.00 MiB, K (f16): 16.00 MiB, V (f16): 16.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.49 MiB
llama_new_context_with_model: Metal compute buffer size = 254.50 MiB
llama_new_context_with_model: CPU compute buffer size = 6.01 MiB
llama_new_context_with_model: graph nodes = 518
llama_new_context_with_model: graph splits = 2
| llama ?B Q4_K - Medium | 968.30 MiB | 1.50 B | Metal | 17 | tg1024 | 149.36 ± 0.47 |
llama_perf_context_print: load time = 1341.56 ms
llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: eval time = 0.00 ms / 5121 runs ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: total time = 35623.50 ms / 5122 tokens
ggml_metal_free: deallocating
build: 1b2f992c (3837)
ollama logs:
cat ~/.ollama/logs/server.log
2024/10/03 08:09:46 routes.go:1153: INFO server config env="map[HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/Users/391080/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: http_proxy: https_proxy: no_proxy:]"
time=2024-10-03T08:09:46.111+05:30 level=INFO source=images.go:753 msg="total blobs: 18"
time=2024-10-03T08:09:46.113+05:30 level=INFO source=images.go:760 msg="total unused blobs removed: 0"
time=2024-10-03T08:09:46.114+05:30 level=INFO source=routes.go:1200 msg="Listening on 127.0.0.1:11434 (version 0.3.12)"
time=2024-10-03T08:09:46.115+05:30 level=INFO source=common.go:135 msg="extracting embedded files" dir=/var/folders/_y/gbtnmk4s65lfdzb3vw888hydbsxg_p/T/ollama2160509689/runners
time=2024-10-03T08:09:46.135+05:30 level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners=[metal]
time=2024-10-03T08:09:46.201+05:30 level=INFO source=types.go:107 msg="inference compute" id=0 library=metal variant="" compute="" driver=0.0 name="" total="27.0 GiB" available="27.0 GiB"
[GIN] 2024/10/03 - 08:09:46 | 200 | 93.209µs | 127.0.0.1 | HEAD "/"
[GIN] 2024/10/03 - 08:09:46 | 200 | 11.964959ms | 127.0.0.1 | POST "/api/show"
time=2024-10-03T08:09:46.631+05:30 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/Users/391080/.ollama/models/blobs/sha256-d06ffdc00fd5175ccb2371c6686ba63ed30bc915253158721344924bf699401e gpu=0 parallel=4 available=28991029248 required="2.1 GiB"
time=2024-10-03T08:09:46.631+05:30 level=INFO source=server.go:103 msg="system memory" total="36.0 GiB" free="28.5 GiB" free_swap="0 B"
time=2024-10-03T08:09:46.631+05:30 level=INFO source=memory.go:326 msg="offload to metal" layers.requested=-1 layers.model=17 layers.offload=17 layers.split="" memory.available="[27.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="2.1 GiB" memory.required.partial="2.1 GiB" memory.required.kv="256.0 MiB" memory.required.allocations="[2.1 GiB]" memory.weights.total="813.3 MiB" memory.weights.repeating="607.8 MiB" memory.weights.nonrepeating="205.5 MiB" memory.graph.full="544.0 MiB" memory.graph.partial="544.0 MiB"
time=2024-10-03T08:09:46.633+05:30 level=INFO source=server.go:388 msg="starting llama server" cmd="/var/folders/_y/gbtnmk4s65lfdzb3vw888hydbsxg_p/T/ollama2160509689/runners/metal/ollama_llama_server --model /Users/391080/.ollama/models/blobs/sha256-d06ffdc00fd5175ccb2371c6686ba63ed30bc915253158721344924bf699401e --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 17 --parallel 4 --port 49469"
time=2024-10-03T08:09:46.642+05:30 level=INFO source=sched.go:449 msg="loaded runners" count=1
time=2024-10-03T08:09:46.642+05:30 level=INFO source=server.go:587 msg="waiting for llama runner to start responding"
time=2024-10-03T08:09:46.642+05:30 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server error"
INFO [main] build info | build=3670 commit="194ef086" tid="0x1f435bac0" timestamp=1727923187
INFO [main] system info | n_threads=10 n_threads_batch=10 system_info="AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="0x1f435bac0" timestamp=1727923187 total_threads=14
INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="13" port="49469" tid="0x1f435bac0" timestamp=1727923187
time=2024-10-03T08:09:47.397+05:30 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model"
llama_model_loader: loaded meta data with 29 key-value pairs and 147 tensors from /Users/391080/.ollama/models/blobs/sha256-d06ffdc00fd5175ccb2371c6686ba63ed30bc915253158721344924bf699401e (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Llama 3.2 1B
llama_model_loader: - kv 3: general.basename str = Llama-3.2
llama_model_loader: - kv 4: general.size_label str = 1B
llama_model_loader: - kv 5: general.license str = llama3.2
llama_model_loader: - kv 6: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv 7: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv 8: llama.block_count u32 = 16
llama_model_loader: - kv 9: llama.context_length u32 = 131072
llama_model_loader: - kv 10: llama.embedding_length u32 = 2048
llama_model_loader: - kv 11: llama.feed_forward_length u32 = 8192
llama_model_loader: - kv 12: llama.attention.head_count u32 = 32
llama_model_loader: - kv 13: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 14: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 15: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 16: llama.attention.key_length u32 = 64
llama_model_loader: - kv 17: llama.attention.value_length u32 = 64
llama_model_loader: - kv 18: general.file_type u32 = 15
llama_model_loader: - kv 19: llama.vocab_size u32 = 128256
llama_model_loader: - kv 20: llama.rope.dimension_count u32 = 64
llama_model_loader: - kv 21: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 22: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 25: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 128001
llama_model_loader: - kv 28: general.quantization_version u32 = 2
llama_model_loader: - type f32: 34 tensors
llama_model_loader: - type q4_K: 96 tensors
llama_model_loader: - type q6_K: 17 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 2048
llm_load_print_meta: n_layer = 16
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 64
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 64
llm_load_print_meta: n_embd_head_v = 64
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: n_embd_k_gqa = 512
llm_load_print_meta: n_embd_v_gqa = 512
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 8192
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = ?B
llm_load_print_meta: model ftype = Q4_K - Medium
llm_load_print_meta: model params = 1.24 B
llm_load_print_meta: model size = 762.81 MiB (5.18 BPW)
llm_load_print_meta: general.name = Llama 3.2 1B
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128001 '<|end_of_text|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size = 0.14 MiB
ggml_backend_metal_log_allocated_size: allocated buffer, size = 762.83 MiB, ( 762.91 / 27648.00)
llm_load_tensors: offloading 16 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 17/17 layers to GPU
llm_load_tensors: CPU buffer size = 205.49 MiB
llm_load_tensors: Metal buffer size = 762.82 MiB
llama_new_context_with_model: n_ctx = 8192
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M3 Max
ggml_metal_init: picking default device: Apple M3 Max
ggml_metal_init: using embedded metal library
ggml_metal_init: GPU name: Apple M3 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple9 (1009)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001)
ggml_metal_init: simdgroup reduction support = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: recommendedMaxWorkingSetSize = 28991.03 MB
llama_kv_cache_init: Metal KV buffer size = 256.00 MiB
llama_new_context_with_model: KV self size = 256.00 MiB, K (f16): 128.00 MiB, V (f16): 128.00 MiB
llama_new_context_with_model: CPU output buffer size = 1.99 MiB
llama_new_context_with_model: Metal compute buffer size = 544.00 MiB
llama_new_context_with_model: CPU compute buffer size = 20.01 MiB
llama_new_context_with_model: graph nodes = 518
llama_new_context_with_model: graph splits = 2
INFO [main] model loaded | tid="0x1f435bac0" timestamp=1727923188
time=2024-10-03T08:09:49.159+05:30 level=INFO source=server.go:626 msg="llama runner started in 2.52 seconds"
[GIN] 2024/10/03 - 08:09:49 | 200 | 2.555986541s | 127.0.0.1 | POST "/api/generate"
[GIN] 2024/10/03 - 08:10:20 | 200 | 7.493648584s | 127.0.0.1 | POST "/api/chat"
Version
~~llamafile v0.8.4~~ llamafile v0.8.13
What operating system are you seeing the problem on?
Mac
Relevant log output
llamafile/bin/llamafile-bench -m llamafiles/llama-3.2-1b-q4_k_m.llamafile -ngl 17 -n 1024 -p 512 --verbose
warning: don't know how to govern your cpu temperature; consider setting the environment variables described in llamafile/govern.cpp
| cpu_info | model_filename | size | test | t/s |
| ---------------------------: | ---------------------------------------: | ---------: | ------------: | --------------: |
llama_model_loader: loaded meta data with 29 key-value pairs and 147 tensors from llamafiles/llama-3.2-1b-q4_k_m.llamafile (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Llama 3.2 1B
llama_model_loader: - kv 3: general.basename str = Llama-3.2
llama_model_loader: - kv 4: general.size_label str = 1B
llama_model_loader: - kv 5: general.license str = llama3.2
llama_model_loader: - kv 6: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv 7: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv 8: llama.block_count u32 = 16
llama_model_loader: - kv 9: llama.context_length u32 = 131072
llama_model_loader: - kv 10: llama.embedding_length u32 = 2048
llama_model_loader: - kv 11: llama.feed_forward_length u32 = 8192
llama_model_loader: - kv 12: llama.attention.head_count u32 = 32
llama_model_loader: - kv 13: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 14: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 15: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 16: llama.attention.key_length u32 = 64
llama_model_loader: - kv 17: llama.attention.value_length u32 = 64
llama_model_loader: - kv 18: general.file_type u32 = 15
llama_model_loader: - kv 19: llama.vocab_size u32 = 128256
llama_model_loader: - kv 20: llama.rope.dimension_count u32 = 64
llama_model_loader: - kv 21: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 22: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 25: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 128001
llama_model_loader: - kv 28: general.quantization_version u32 = 2
llama_model_loader: - type f32: 34 tensors
llama_model_loader: - type q4_K: 96 tensors
llama_model_loader: - type q6_K: 17 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 2048
llm_load_print_meta: n_layer = 16
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 64
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 64
llm_load_print_meta: n_embd_head_v = 64
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: n_embd_k_gqa = 512
llm_load_print_meta: n_embd_v_gqa = 512
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 8192
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = ?B
llm_load_print_meta: model ftype = Q4_K - Medium
llm_load_print_meta: model params = 1.24 B
llm_load_print_meta: model size = 762.81 MiB (5.18 BPW)
llm_load_print_meta: general.name = Llama 3.2 1B
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128001 '<|end_of_text|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size = 0.08 MiB
llm_load_tensors: CPU buffer size = 762.81 MiB
.......................................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CPU KV buffer size = 16.00 MiB
llama_new_context_with_model: KV self size = 16.00 MiB, K (f16): 8.00 MiB, V (f16): 8.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.49 MiB
llama_new_context_with_model: CPU compute buffer size = 254.50 MiB
llama_new_context_with_model: graph nodes = 518
llama_new_context_with_model: graph splits = 1
| Apple M3 Max (+fp16+dotprod) | llama-3.2-1b-q4_k_m | 968.30 MiB | pp512 | 686.27 |
llama_print_timings: load time = 812.26 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: prompt eval time = 2998.05 ms / 2048 tokens ( 1.46 ms per token, 683.11 tokens per second)
llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: total time = 3049.98 ms / 2049 tokens
llama_new_context_with_model: n_ctx = 1024
llama_new_context_with_model: n_batch = 1024
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CPU KV buffer size = 32.00 MiB
llama_new_context_with_model: KV self size = 32.00 MiB, K (f16): 16.00 MiB, V (f16): 16.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.49 MiB
llama_new_context_with_model: CPU compute buffer size = 254.50 MiB
llama_new_context_with_model: graph nodes = 518
llama_new_context_with_model: graph splits = 1
| Apple M3 Max (+fp16+dotprod) | llama-3.2-1b-q4_k_m | 968.30 MiB | tg1024 | 114.46 |
llama_print_timings: load time = 3062.09 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: prompt eval time = 0.00 ms / 0 tokens ( nan ms per token, nan tokens per second)
llama_print_timings: eval time = 26868.69 ms / 3073 runs ( 8.74 ms per token, 114.37 tokens per second)
llama_print_timings: total time = 29923.98 ms / 3073 tokens
llamafile v0.8.4
This is a old one... can you bench with at least last published release: V0.8.13 (ps: no need to rebuild juste get it from release and give it your model ( -m zzz.gguf / -m zzz.llamafile)
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size = 0.08 MiB
llm_load_tensors: CPU buffer size = 762.81 MiB
.......................................................
Also in llamafile case all is compute on CPU not with GPU (Metal) I dont know if it can use GPU on that old release.
@Djip007 My bad. I had done the above tests with latest release v0.8.13 but while filling the github issue, I mentioned the version from the issue default template. I apologise for the error. I have edited it correctly now.
Also in llamafile case all is compute on CPU not with GPU (Metal) I dont know if it can use GPU on that old release.
I have given the number of gpu layers to be offloaded as 17
llamafile/bin/llamafile-bench -m llamafiles/llama-3.2-1b-q4_k_m.llamafile -ngl 17 -n 1024 -p 512 --verbose
Does this mean -ngl is not working as expected?
Does this mean
-nglis not working as expected?
Ho yes if I am right... llamafile-bench do only CPU bench for now. (but llamafile did support it ... may be with some "bug" with V0.8.13 : https://github.com/Mozilla-Ocho/llamafile/pull/534)
llamafile bench currently only supports cpu. I can put up a branch that will enable gpu support tomorrow
the fix in #534 should resolve the issue with gpu performance being slower than llama.cpp
llamafile bench currently only supports cpu. I can put up a branch that will enable gpu support tomorrow
Can be nice !
#581 will address gpu support in the bench
able to get correct results now. Thanks!