llama.cpp
llama.cpp copied to clipboard
Can not offload layers to GPU (llama3)
make LLAMA_CUDA=1 ./main -m /models/Meta-Llama-3-70B-Instruct.Q4_K_M.gguf -r '<|eot_id|>' --in-prefix "\n<|start_header_id|>user<|end_header_id|>\n\n" --in-suffix "<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" -p "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful, smart, kind, and efficient AI assistant. You always fulfill the user's requests to the best of your ability.<|eot_id|>\n<|start_header_id|>user<|end_header_id|>\n\nHi! How are you?<|eot_id|>\n<|start_header_id|>assistant<|end_header_id|>\n\n" -n 1024 -ngl 90
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA H800, compute capability 9.0, VMM: yes llm_load_tensors: ggml ctx size = 0.15 MiB llm_load_tensors: offloading 0 repeating layers to GPU llm_load_tensors: offloaded 0/33 layers to GPU llm_load_tensors: CPU buffer size = 5459.93 MiB ......................................................................................... llama_new_context_with_model: n_ctx = 512 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA_Host KV buffer size = 64.00 MiB llama_new_context_with_model: KV self size = 64.00 MiB, K (f16): 32.00 MiB, V (f16): 32.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 0.49 MiB llama_new_context_with_model: CUDA0 compute buffer size = 669.48 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 9.01 MiB llama_new_context_with_model: graph nodes = 1030 llama_new_context_with_model: graph splits = 356
Could this be from newlines in your shell? You might be running ./main -m /models/Meta-Llama-3-70B-Instruct.Q4_K_M.gguf -r '<|eot_id|>' and then separately trying to run --in-prefix "\n<|start_header_id|>user<|end_header_id|>\n\n" and so on.
Could this be from newlines in your shell? You might be running
./main -m /models/Meta-Llama-3-70B-Instruct.Q4_K_M.gguf -r '<|eot_id|>'and then separately trying to run--in-prefix "\n<|start_header_id|>user<|end_header_id|>\n\n"and so on.
What is the corespond -r '<|eot_id|>'` in llama server mode?
I always got <|eot_id|> in response.
This issue was closed because it has been inactive for 14 days since being marked as stale.