llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Can not offload layers to GPU (llama3)

Open realcarlos opened this issue 1 year ago • 2 comments

make LLAMA_CUDA=1 ./main -m /models/Meta-Llama-3-70B-Instruct.Q4_K_M.gguf -r '<|eot_id|>' --in-prefix "\n<|start_header_id|>user<|end_header_id|>\n\n" --in-suffix "<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" -p "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful, smart, kind, and efficient AI assistant. You always fulfill the user's requests to the best of your ability.<|eot_id|>\n<|start_header_id|>user<|end_header_id|>\n\nHi! How are you?<|eot_id|>\n<|start_header_id|>assistant<|end_header_id|>\n\n" -n 1024 -ngl 90

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA H800, compute capability 9.0, VMM: yes llm_load_tensors: ggml ctx size = 0.15 MiB llm_load_tensors: offloading 0 repeating layers to GPU llm_load_tensors: offloaded 0/33 layers to GPU llm_load_tensors: CPU buffer size = 5459.93 MiB ......................................................................................... llama_new_context_with_model: n_ctx = 512 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA_Host KV buffer size = 64.00 MiB llama_new_context_with_model: KV self size = 64.00 MiB, K (f16): 32.00 MiB, V (f16): 32.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 0.49 MiB llama_new_context_with_model: CUDA0 compute buffer size = 669.48 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 9.01 MiB llama_new_context_with_model: graph nodes = 1030 llama_new_context_with_model: graph splits = 356

realcarlos avatar Apr 20 '24 14:04 realcarlos

Could this be from newlines in your shell? You might be running ./main -m /models/Meta-Llama-3-70B-Instruct.Q4_K_M.gguf -r '<|eot_id|>' and then separately trying to run --in-prefix "\n<|start_header_id|>user<|end_header_id|>\n\n" and so on.

Engininja2 avatar Apr 20 '24 17:04 Engininja2

Could this be from newlines in your shell? You might be running ./main -m /models/Meta-Llama-3-70B-Instruct.Q4_K_M.gguf -r '<|eot_id|>' and then separately trying to run --in-prefix "\n<|start_header_id|>user<|end_header_id|>\n\n" and so on.

What is the corespond -r '<|eot_id|>'` in llama server mode?

I always got <|eot_id|> in response.

bash99 avatar Apr 22 '24 11:04 bash99

This issue was closed because it has been inactive for 14 days since being marked as stale.

github-actions[bot] avatar Jun 06 '24 01:06 github-actions[bot]