jan bug: GPU barely not used

bug: GPU barely not used

Open Hubert21 opened this issue 1 month ago • 3 comments

Version: 0.7.1

Describe the Bug

my GPU is barely not used ; instead the CPU is running 100%, even with a small model that does not take all the VRAM

Steps to Reproduce

run any model

Screenshots / Logs

[2:26:34 PM] INFO modelSize: 4920734624 [2:26:34 PM] DEBUG starting new connection: http://localhost:3669/ [2:26:34 PM] DEBUG starting new connection: http://localhost:3669/ [2:26:34 PM] INFO [llamacpp] srv log_server_r: request: GET /health 127.0.0.1 200 [2:26:34 PM] DEBUG starting new connection: http://localhost:3669/ [2:26:35 PM] INFO [llamacpp] srv log_server_r: request: GET /health 127.0.0.1 200 [2:26:35 PM] DEBUG starting new connection: http://localhost:3669/ [2:26:35 PM] INFO [llamacpp] srv params_from_: Chat format: Content-only [2:26:35 PM] INFO [llamacpp] slot get_availabl: id 0 | task 0 | selected slot by lcs similarity, lcs_len = 277, similarity = 0.249 (> 0.100 thold) [2:26:35 PM] INFO [llamacpp] slot launch_slot_: id 0 | task 839 | processing task [2:26:35 PM] INFO [llamacpp] slot update_slots: id 0 | task 839 | new prompt, n_ctx_slot = 8192, n_keep = 0, n_prompt_tokens = 277 [2:26:35 PM] INFO [llamacpp] slot update_slots: id 0 | task 839 | need to evaluate at least 1 token for each active slot, n_past = 277, n_prompt_tokens = 277 [2:26:35 PM] INFO [llamacpp] slot update_slots: id 0 | task 839 | kv cache rm [276, end) [2:26:35 PM] INFO [llamacpp] slot update_slots: id 0 | task 839 | prompt processing progress, n_past = 277, n_tokens = 1, progress = 0.003610 [2:26:35 PM] INFO [llamacpp] slot update_slots: id 0 | task 839 | prompt done, n_past = 277, n_tokens = 1 [2:26:35 PM] INFO [llamacpp] srv log_server_r: request: POST /apply-template 127.0.0.1 200 [2:26:35 PM] DEBUG starting new connection: http://localhost:3669/ [2:26:35 PM] INFO [llamacpp] srv log_server_r: request: POST /tokenize 127.0.0.1 200 [2:26:36 PM] INFO Using ctx_size: 8192 [2:26:36 PM] INFO Received ctx_size parameter: Some(8192) [2:26:36 PM] INFO Received model metadata: [4:28:19 PM] INFO {"llama.rope.dimension_count": "128", "general.type": "model", "general.basename": "models-meta-llama-Meta-Llama-3.1", "llama.embedding_length": "4096", "tokenizer.chat_template": "{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}", "general.file_type": "15", "llama.feed_forward_length": "14336", "general.architecture": "llama", "quantize.imatrix.entries_count": "224", "tokenizer.ggml.token_type": "<Array of type Int32 with 128256 elements, data skipped>", "general.languages": "[en, de, fr, it, pt, hi, es, th]", "quantize.imatrix.chunks_count": "68", "tokenizer.ggml.tokens": "<Array of type String with 128256 elements, data skipped>", "general.finetune": "Instruct", "quantize.imatrix.dataset": "group_40.txt", "general.license": "llama3.1", "llama.attention.head_count_kv": "8", "llama.vocab_size": "128256", "llama.context_length": "131072", "llama.rope.freq_base": "500000", "tokenizer.ggml.bos_token_id": "128000", "general.name": "Models Meta Llama Meta Llama 3.1 8B Instruct", "tokenizer.ggml.pre": "smaug-bpe", "general.size_label": "8B", "llama.block_count": "32", "llama.attention.layer_norm_rms_epsilon": "0.00001", "tokenizer.ggml.merges": "<Array of type String with 280147 elements, data skipped>", "quantize.imatrix.file": "./Meta-Llama-3.1-8B-Instruct-GGUF_imatrix.dat", "llama.attention.head_count": "32", "tokenizer.ggml.model": "gpt2", "tokenizer.ggml.eos_token_id": "128009", "general.quantization_version": "2", "general.tags": "[facebook, meta, pytorch, llama, llama-3, text-generation]"} [2:26:36 PM] INFO Calculated key_len and val_len from embedding_length: 4096 / 32 heads = 128 per head [2:26:36 PM] INFO KV estimate (no SWA detected) -> full: 1073741824 bytes (~1024.00 MB) [2:26:36 PM] INFO isModelSupported: Total memory requirement: 5994476448 for \?\G:\data_de_jan\llamacpp\models\Meta-Llama-3_1-8B-Instruct_Q4_K_M\model.gguf; Got kvCacheSize: 1073741824 from BE [2:26:36 PM] INFO Total VRAM reported/calculated (in bytes): 6442450944 [2:26:36 PM] INFO System RAM: 17154703360 bytes [2:26:36 PM] INFO Total VRAM: 6442450944 bytes [2:26:36 PM] INFO Usable total memory: 19020173926 bytes [2:26:36 PM] INFO Usable VRAM: 4153960755 bytes [2:26:36 PM] INFO Required: 5994476448 bytes [2:27:09 PM] DEBUG Asset logs not found; fallback to logs.html [2:27:09 PM] DEBUG Asset logs not found; fallback to logs/index.html [2:27:09 PM] DEBUG Asset logs not found; fallback to index.html [2:27:10 PM] INFO get jan extensions, path: "G:\data_de_jan\extensions\extensions.json" [2:27:40 PM] DEBUG Asset logs not found; fallback to logs.html [2:27:40 PM] DEBUG Asset logs not found; fallback to logs/index.html [2:27:40 PM] DEBUG Asset logs not found; fallback to index.html [2:27:40 PM] INFO get jan extensions, path: "G:\data_de_jan\extensions\extensions.json" [2:27:51 PM] INFO [llamacpp] slot release: id 0 | task 839 | stop processing: n_past = 2007, truncated = 0 [2:27:51 PM] INFO [llamacpp] slot print_timing: id 0 | task 839 | [2:27:51 PM] INFO [llamacpp] prompt eval time = 294.30 ms / 1 tokens ( 294.30 ms per token, 3.40 tokens per second) [2:27:51 PM] INFO [llamacpp] eval time = 76150.92 ms / 1731 tokens ( 43.99 ms per token, 22.73 tokens per second) [2:27:51 PM] INFO [llamacpp] total time = 76445.22 ms / 1732 tokens [2:27:51 PM] INFO [llamacpp] srv update_slots: all slots are idle [2:27:51 PM] INFO [llamacpp] srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200

Operating System

[ ] MacOS
[x ] Windows
[ ] Linux

Oct 13 '25 14:10 Hubert21

jan jan copied to clipboard

bug: GPU barely not used

Describe the Bug

Steps to Reproduce

Screenshots / Logs

Operating System

jan
jan copied to clipboard