llama.cpp Can not offload layers to GPU (llama3)

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA H800, compute capability 9.0, VMM: yes llm_load_tensors: ggml ctx size = 0.15 MiB llm_load_tensors: offloading 0 repeating layers to GPU llm_load_tensors: offloaded 0/33 layers to GPU llm_load_tensors: CPU buffer size = 5459.93 MiB ......................................................................................... llama_new_context_with_model: n_ctx = 512 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA_Host KV buffer size = 64.00 MiB llama_new_context_with_model: KV self size = 64.00 MiB, K (f16): 32.00 MiB, V (f16): 32.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 0.49 MiB llama_new_context_with_model: CUDA0 compute buffer size = 669.48 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 9.01 MiB llama_new_context_with_model: graph nodes = 1030 llama_new_context_with_model: graph splits = 356

Apr 20 '24 14:04 realcarlos

Apr 20 '24 17:04 Engininja2

Could this be from newlines in your shell? You might be running ./main -m /models/Meta-Llama-3-70B-Instruct.Q4_K_M.gguf -r '<|eot_id|>' and then separately trying to run --in-prefix "\n<|start_header_id|>user<|end_header_id|>\n\n" and so on.

What is the corespond -r '<|eot_id|>'` in llama server mode?

I always got <|eot_id|> in response.

Apr 22 '24 11:04 bash99

This issue was closed because it has been inactive for 14 days since being marked as stale.

Jun 06 '24 01:06 github-actions[bot]

llama.cpp llama.cpp copied to clipboard

Can not offload layers to GPU (llama3)

llama.cpp
llama.cpp copied to clipboard