Prerequisites

Please answer the following questions for yourself before submitting an issue.

[ Yes] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
[ Yes] I carefully followed the README.md.
[ Yes] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
[ Yes] I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

Expected to load my model on the T4 GPU on colab

CUDA VERSION - 12.2

INSTALL COMMAND - !pip install llama-cpp-python
--extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu122 --verbose

Current Behavior

Zero GPU Usage

llama_model_loader: loaded meta data with 33 key-value pairs and 291 tensors from /root/.cache/huggingface/hub/models--AnirudhJM24--Llama3-OpenBioLLM-8B-Q4_K_M-GGUF/snapshots/8f01788085a3ac57ddb617392855d6188514b974/llama3-openbiollm-8b-q4_k_m.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Meta Llama 3 8B llama_model_loader: - kv 3: general.organization str = Meta Llama llama_model_loader: - kv 4: general.basename str = Meta-Llama-3 llama_model_loader: - kv 5: general.size_label str = 8B llama_model_loader: - kv 6: general.license str = llama3 llama_model_loader: - kv 7: general.base_model.count u32 = 1 llama_model_loader: - kv 8: general.base_model.0.name str = Meta Llama 3 8B llama_model_loader: - kv 9: general.base_model.0.organization str = Meta Llama llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/meta-llama/Met... llama_model_loader: - kv 11: general.tags arr[str,10] = ["llama-3", "llama", "Mixtral", "inst... llama_model_loader: - kv 12: general.languages arr[str,1] = ["en"] llama_model_loader: - kv 13: llama.block_count u32 = 32 llama_model_loader: - kv 14: llama.context_length u32 = 8192 llama_model_loader: - kv 15: llama.embedding_length u32 = 4096 llama_model_loader: - kv 16: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 17: llama.attention.head_count u32 = 32 llama_model_loader: - kv 18: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 19: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 20: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 21: general.file_type u32 = 15 llama_model_loader: - kv 22: llama.vocab_size u32 = 128256 llama_model_loader: - kv 23: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 25: tokenizer.ggml.pre str = smaug-bpe llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,128256] = ["!", """, "#", "$", "%", "&", "'", ... llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 29: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 30: tokenizer.ggml.eos_token_id u32 = 128001 llama_model_loader: - kv 31: tokenizer.ggml.padding_token_id u32 = 128001 llama_model_loader: - kv 32: general.quantization_version u32 = 2 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q4_K: 193 tensors llama_model_loader: - type q6_K: 33 tensors llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect llm_load_vocab: special tokens cache size = 256 llm_load_vocab: token to piece cache size = 0.8000 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 8192 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 8192 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: ssm_dt_b_c_rms = 0 llm_load_print_meta: model type = 8B llm_load_print_meta: model ftype = Q4_K - Medium llm_load_print_meta: model params = 8.03 B llm_load_print_meta: model size = 4.58 GiB (4.89 BPW) llm_load_print_meta: general.name = Meta Llama 3 8B llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128001 '<|end_of_text|>' llm_load_print_meta: PAD token = 128001 '<|end_of_text|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' llm_load_print_meta: EOG token = 128001 '<|end_of_text|>' llm_load_print_meta: EOG token = 128009 '<|eot_id|>' llm_load_print_meta: max token length = 256 llm_load_tensors: ggml ctx size = 0.14 MiB llm_load_tensors: CPU buffer size = 4685.30 MiB ........................................................................................ llama_new_context_with_model: n_ctx = 512 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CPU KV buffer size = 64.00 MiB llama_new_context_with_model: KV self size = 64.00 MiB, K (f16): 32.00 MiB, V (f16): 32.00 MiB llama_new_context_with_model: CPU output buffer size = 0.49 MiB llama_new_context_with_model: CPU compute buffer size = 258.50 MiB llama_new_context_with_model: graph nodes = 1030 llama_new_context_with_model: graph splits = 1 AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | Model metadata: {'tokenizer.ggml.eos_token_id': '128001', 'general.quantization_version': '2', 'tokenizer.ggml.model': 'gpt2', 'llama.vocab_size': '128256', 'general.file_type': '15', 'llama.attention.layer_norm_rms_epsilon': '0.000010', 'llama.rope.freq_base': '500000.000000', 'tokenizer.ggml.bos_token_id': '128000', 'llama.attention.head_count': '32', 'llama.feed_forward_length': '14336', 'general.architecture': 'llama', 'llama.attention.head_count_kv': '8', 'llama.block_count': '32', 'tokenizer.ggml.padding_token_id': '128001', 'general.basename': 'Meta-Llama-3', 'llama.embedding_length': '4096', 'general.base_model.0.organization': 'Meta Llama', 'tokenizer.ggml.pre': 'smaug-bpe', 'llama.context_length': '8192', 'general.name': 'Meta Llama 3 8B', 'llama.rope.dimension_count': '128', 'general.base_model.0.name': 'Meta Llama 3 8B', 'general.organization': 'Meta Llama', 'general.type': 'model', 'general.size_label': '8B', 'general.base_model.0.repo_url': 'https://huggingface.co/meta-llama/Meta-Llama-3-8B', 'general.license': 'llama3', 'general.base_model.count': '1'}

Environment and Context

Google Colab

Oct 02 '24 16:10 AnirudhJM24

Here's an example that does work on a Google Colab T4 instance:

%pip install --quiet https://github.com/abetlen/llama-cpp-python/releases/download/v0.2.90-cu124/llama_cpp_python-0.2.90-cp310-cp310-linux_x86_64.whl

from llama_cpp import Llama

llm = Llama.from_pretrained(
    repo_id="bartowski/Meta-Llama-3.1-8B-Instruct-GGUF",
    filename="*Q4_K_M.gguf",
    n_ctx=8192,
    n_gpu_layers=-1,
    verbose=True
)

llm("Q: Name the planets in the solar system? A: ")

Oct 04 '24 14:10 lsorber

!pip install huggingface-hub fsspec==2023.6.0 !pip install --quiet https://github.com/abetlen/llama-cpp-python/releases/download/v0.2.90-cu122/llama_cpp_python-0.2.90-cp310-cp310-linux_x86_64.whl

from llama_cpp import Llama

llm = Llama.from_pretrained( repo_id="Qwen/Qwen2-0.5B-Instruct-GGUF", filename="*q8_0.gguf", n_ctx=8192, n_gpu_layers=-1, verbose=True )

llm("Q: Name the planets in the solar system? A: ")

Requirement already satisfied: huggingface-hub in /usr/local/lib/python3.10/dist-packages (0.24.7) Requirement already satisfied: fsspec==2023.6.0 in /usr/local/lib/python3.10/dist-packages (2023.6.0) Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from huggingface-hub) (3.16.1) Requirement already satisfied: packaging>=20.9 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub) (24.1) Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub) (6.0.2) Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from huggingface-hub) (2.32.3) Requirement already satisfied: tqdm>=4.42.1 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub) (4.66.6) Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub) (4.12.2) Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->huggingface-hub) (3.4.0) Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->huggingface-hub) (3.10) Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->huggingface-hub) (2.2.3) Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->huggingface-hub) (2024.8.30) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 443.8/443.8 MB 1.4 MB/s eta 0:00:00 llama_model_loader: loaded meta data with 26 key-value pairs and 290 tensors from /root/.cache/huggingface/hub/models--Qwen--Qwen2-0.5B-Instruct-GGUF/snapshots/198f08841147e5196a6a69bd0053690fb1fd3857/./qwen2-0_5b-instruct-q8_0.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen2 llama_model_loader: - kv 1: general.name str = qwen2-0_5b-instruct llama_model_loader: - kv 2: qwen2.block_count u32 = 24 llama_model_loader: - kv 3: qwen2.context_length u32 = 32768 llama_model_loader: - kv 4: qwen2.embedding_length u32 = 896 llama_model_loader: - kv 5: qwen2.feed_forward_length u32 = 4864 llama_model_loader: - kv 6: qwen2.attention.head_count u32 = 14 llama_model_loader: - kv 7: qwen2.attention.head_count_kv u32 = 2 llama_model_loader: - kv 8: qwen2.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 9: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 10: general.file_type u32 = 7 llama_model_loader: - kv 11: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 12: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,151936] = ["!", """, "#", "$", "%", "&", "'", ... llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 15: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32 = 151645 llama_model_loader: - kv 17: tokenizer.ggml.padding_token_id u32 = 151643 llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 151643 llama_model_loader: - kv 19: tokenizer.chat_template str = {% for message in messages %}{% if lo... llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 21: general.quantization_version u32 = 2 llama_model_loader: - kv 22: quantize.imatrix.file str = ../Qwen2/gguf/qwen2-0_5b-imatrix/imat... llama_model_loader: - kv 23: quantize.imatrix.dataset str = ../sft_2406.txt llama_model_loader: - kv 24: quantize.imatrix.entries_count i32 = 168 llama_model_loader: - kv 25: quantize.imatrix.chunks_count i32 = 1937 llama_model_loader: - type f32: 121 tensors llama_model_loader: - type q8_0: 169 tensors llm_load_vocab: special tokens cache size = 293 llm_load_vocab: token to piece cache size = 0.9338 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = qwen2 llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 151936 llm_load_print_meta: n_merges = 151387 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 32768 llm_load_print_meta: n_embd = 896 llm_load_print_meta: n_layer = 24 llm_load_print_meta: n_head = 14 llm_load_print_meta: n_head_kv = 2 llm_load_print_meta: n_rot = 64 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 64 llm_load_print_meta: n_embd_head_v = 64 llm_load_print_meta: n_gqa = 7 llm_load_print_meta: n_embd_k_gqa = 128 llm_load_print_meta: n_embd_v_gqa = 128 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 4864 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 32768 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: ssm_dt_b_c_rms = 0 llm_load_print_meta: model type = 1B llm_load_print_meta: model ftype = Q8_0 llm_load_print_meta: model params = 494.03 M llm_load_print_meta: model size = 500.79 MiB (8.50 BPW) llm_load_print_meta: general.name = qwen2-0_5b-instruct llm_load_print_meta: BOS token = 151643 '<|endoftext|>' llm_load_print_meta: EOS token = 151645 '<|im_end|>' llm_load_print_meta: PAD token = 151643 '<|endoftext|>' llm_load_print_meta: LF token = 148848 'ÄĬ' llm_load_print_meta: EOT token = 151645 '<|im_end|>' llm_load_print_meta: max token length = 256 llm_load_tensors: ggml ctx size = 0.25 MiB llm_load_tensors: offloading 24 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 25/25 layers to GPU llm_load_tensors: CPU buffer size = 137.94 MiB llm_load_tensors: CUDA0 buffer size = 500.84 MiB ........................................................... llama_new_context_with_model: n_ctx = 8192 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA0 KV buffer size = 96.00 MiB llama_new_context_with_model: KV self size = 96.00 MiB, K (f16): 48.00 MiB, V (f16): 48.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 0.58 MiB llama_new_context_with_model: CUDA0 compute buffer size = 298.50 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 17.76 MiB llama_new_context_with_model: graph nodes = 846 llama_new_context_with_model: graph splits = 2 AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | Model metadata: {'quantize.imatrix.entries_count': '168', 'quantize.imatrix.dataset': '../sft_2406.txt', 'quantize.imatrix.chunks_count': '1937', 'quantize.imatrix.file': '../Qwen2/gguf/qwen2-0_5b-imatrix/imatrix.dat', 'tokenizer.ggml.add_bos_token': 'false', 'tokenizer.ggml.bos_token_id': '151643', 'general.architecture': 'qwen2', 'qwen2.block_count': '24', 'qwen2.context_length': '32768', 'tokenizer.chat_template': "{% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n' }}{% endif %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}", 'qwen2.attention.head_count_kv': '2', 'tokenizer.ggml.padding_token_id': '151643', 'qwen2.embedding_length': '896', 'qwen2.attention.layer_norm_rms_epsilon': '0.000001', 'qwen2.attention.head_count': '14', 'tokenizer.ggml.eos_token_id': '151645', 'qwen2.rope.freq_base': '1000000.000000', 'general.file_type': '7', 'general.quantization_version': '2', 'qwen2.feed_forward_length': '4864', 'tokenizer.ggml.model': 'gpt2', 'general.name': 'qwen2-0_5b-instruct', 'tokenizer.ggml.pre': 'qwen2'} Available chat formats from metadata: chat_template.default Using gguf chat template: {% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system You are a helpful assistant.<|im_end|> ' }}{% endif %}{{'<|im_start|>' + message['role'] + ' ' + message['content'] + '<|im_end|>' + ' '}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant ' }}{% endif %} Using chat eos_token: <|im_end|> Using chat bos_token: <|endoftext|>

llama_print_timings: load time = 20.02 ms llama_print_timings: sample time = 2.58 ms / 16 runs ( 0.16 ms per token, 6194.35 tokens per second) llama_print_timings: prompt eval time = 19.89 ms / 13 tokens ( 1.53 ms per token, 653.76 tokens per second) llama_print_timings: eval time = 141.56 ms / 15 runs ( 9.44 ms per token, 105.96 tokens per second) llama_print_timings: total time = 185.27 ms / 28 tokens {'id': 'cmpl-84a4335e-e465-4af4-a004-5c81d352fab5', 'object': 'text_completion', 'created': 1730768059, 'model': '/root/.cache/huggingface/hub/models--Qwen--Qwen2-0.5B-Instruct-GGUF/snapshots/198f08841147e5196a6a69bd0053690fb1fd3857/./qwen2-0_5b-instruct-q8_0.gguf', 'choices': [{'text': '5. Mercury, Venus, Earth, Mars, Jupiter, Saturn. Question:', 'index': 0, 'logprobs': None, 'finish_reason': 'length'}], 'usage': {'prompt_tokens': 13, 'completion_tokens': 16, 'total_tokens': 29}}

Nov 05 '24 00:11 werruww

llm_load_tensors: offloading 24 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 25/25 layers to GPU llm_load_tensors: CPU buffer size = 137.94 MiB llm_load_tensors: CUDA0 buffer size = 500.84 MiB ........................................................... llama_new_context_with_model: n_ctx = 8192 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA0 KV buffer size = 96.00 MiB llama_new_context_with_model: KV self size = 96.00 MiB, K (f16): 48.00 MiB, V (f16): 48.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 0.58 MiB llama_new_context_with_model: CUDA0 compute buffer size = 298.50 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 17.76 MiB

Nov 05 '24 00:11 werruww

Google Colab has updated to CUDA 12.5 so none of the prebuilt wheels work anymore as far as I can see and I haven't been able to figure out how to build it manually. Can anyone let me know how to do it? Thanks.

Mar 28 '25 08:03 Ado012

Google Colab has updated to CUDA 12.5 so none of the prebuilt wheels work anymore as far as I can see and I haven't been able to figure out how to build it manually. Can anyone let me know how to do it? Thanks.

Installation in Colab works as before (version 0.3.4):

!pip install https://github.com/abetlen/llama-cpp-python/releases/download/v0.3.4-cu124/llama_cpp_python-0.3.4-cp311-cp311-linux_x86_64.whl

If you need the latest version, you can build the wheel and then save it.

Building the latest version of llama-cpp-python with CUDA support into the llamacpp_wheel directory

!CMAKE_ARGS="-DGGML_CUDA=on" pip wheel --no-deps --wheel-dir=llamacpp_wheel llama-cpp-python

The installation takes about 30-40 minutes, and the GPU must be enabled in Colab. The .whl file will be available in the llamacpp_wheel directory.

(Optional) Saving the .whl file to Google Drive for convenience (after mounting the drive)

import shutil

src_wheel_file = 'llamacpp_wheel/llama_cpp_python-0.3.8-cp311-cp311-linux_x86_64.whl'
trg_wheel_file = '/content/drive/MyDrive/llama_cpp_python-0.3.8-cp311-cp311-linux_x86_64.whl'
shutil.copyfile(src_wheel_file, trg_wheel_file)

Later installation from the saved wheel

!pip install /content/drive/MyDrive/llama_cpp_python-0.3.8-cp311-cp311-linux_x86_64.whl

Mar 29 '25 17:03 sergey21000

Google Colab has updated to CUDA 12.5 so none of the prebuilt wheels work anymore as far as I can see and I haven't been able to figure out how to build it manually. Can anyone let me know how to do it? Thanks.

Installation in Colab works as before (version 0.3.4):
!pip install https://github.com/abetlen/llama-cpp-python/releases/download/v0.3.4-cu124/llama_cpp_python-0.3.4-cp311-cp311-linux_x86_64.whl
If you need the latest version, you can build the wheel and then save it.

Building the latest version of llama-cpp-python with CUDA support into the llamacpp_wheel directory
!CMAKE_ARGS="-DGGML_CUDA=on" pip wheel --no-deps --wheel-dir=llamacpp_wheel llama-cpp-python
The installation takes about 30-40 minutes, and the GPU must be enabled in Colab. The .whl file will be available in the llamacpp_wheel directory.

(Optional) Saving the .whl file to Google Drive for convenience (after mounting the drive)

import shutil

src_wheel_file = 'llamacpp_wheel/llama_cpp_python-0.3.8-cp311-cp311-linux_x86_64.whl' trg_wheel_file = '/content/drive/MyDrive/llama_cpp_python-0.3.8-cp311-cp311-linux_x86_64.whl' shutil.copyfile(src_wheel_file, trg_wheel_file)

Later installation from the saved wheel
!pip install /content/drive/MyDrive/llama_cpp_python-0.3.8-cp311-cp311-linux_x86_64.whl

Okay it looks like the 12.4 wheel does work on the 12.5 environment. Hopefully this will remain true when Google decides to update their CUDA version again.

Apr 10 '25 20:04 Ado012

@Ado012 could you please share the commands you ran or even better your colab notebook. it just doesnt work for me. i tried !pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu124 and with whl/cu125 i also tried !pip install https://github.com/abetlen/llama-cpp-python/releases/download/v0.3.4-cu124/llama_cpp_python-0.3.4-cp311-cp311-linux_x86_64.whl

none of them utilize the gpu. I set n_gpu_layers to -1

Apr 19 '25 06:04 muzzlol

foudn the issue. (Running on Google Colab, Python 3.11.12).

the pre-built wheels command in the docs (using --extra-index-url) insatll the source distribution (.tar.gz) instead of the actual .whl file ergo no gpu support.

This direct install command: !pip install https://github.com/abetlen/llama-cpp-python/releases/download/v0.3.4-cu124/llama_cpp_python-0.3.4-cp311-cp311-linux_x86_64.whl works (this works and uses the GPU) but doesnt suppport newer architectures like Gemma 3:

llama_model_load: error loading model: error loading model architecture: 'gemma3' llama_load_model_from_file: failed to load model

Apr 19 '25 07:04 muzzlol

It's true. Until this day, llama.cpp cannot run the Gemma3 architecture. I could make it run, but the answers were corrupted and were not usefull at all

Jul 11 '25 22:07 NXHM

llama-cpp-python not using GPU on google colab

Prerequisites

Expected Behavior

Current Behavior

Environment and Context