distilabel
distilabel copied to clipboard
Unable to use Llamacpp
when i run the following code on my m1 macbook pro (mac os 14.4.1 (23E224))
from distilabel.pipeline import Pipeline
from distilabel.llms.llamacpp import LlamaCppLLM
from distilabel.steps import LoadDataFromDicts
from distilabel.steps.tasks import TextGeneration
from datasets import load_dataset
datasts = load_dataset(
"/ml/datasets/10k_prompts_ranked",
split="train"
).filter(
lambda r: r['avg_rating']>=4 and r['num_responses']>=2
)
datastsLst = datasts.to_list()
with Pipeline(name="text-generation-pipeline") as pipeline:
load_dataset = LoadDataFromDicts(
name="load_dataset",
data=datastsLst[0:10],
output_mappings={"prompt": "instruction"},
)
text_generation = TextGeneration(
name="text_generation",
llm=LlamaCppLLM(
model_path="/ml/models/Meta-Llama-3-8B-Instruct-GGUF/Meta-Llama-3-8B-Instruct-Q8_0.gguf",
n_gpu_layers=-1,
),
)
load_dataset.connect(text_generation)
if __name__ == "__main__":
pipeline.run(parameters={"text_generation": {"llm": {"generation_kwargs": {"temperature": 0.9,"n_ctx":2048}}}})
I am getting the following error
llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from /ml/models/Meta-Llama-3-8B-Instruct-GGUF/Meta-Llama-3-8B-Instruct-Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = Meta-Llama-3-8B-Instruct-imatrix
llama_model_loader: - kv 2: llama.block_count u32 = 32
llama_model_loader: - kv 3: llama.context_length u32 = 8192
llama_model_loader: - kv 4: llama.embedding_length u32 = 4096
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336
llama_model_loader: - kv 6: llama.attention.head_count u32 = 32
llama_model_loader: - kv 7: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 8: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: general.file_type u32 = 7
llama_model_loader: - kv 11: llama.vocab_size u32 = 128256
llama_model_loader: - kv 12: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 14: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 16: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 17: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 18: tokenizer.ggml.eos_token_id u32 = 128001
llama_model_loader: - kv 19: tokenizer.chat_template str = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv 20: general.quantization_version u32 = 2
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type q8_0: 226 tensors
llm_load_vocab: special tokens definition check successful ( 256/128256 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: n_ctx_train = 8192
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 14336
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 8192
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 8B
llm_load_print_meta: model ftype = Q8_0
llm_load_print_meta: model params = 8.03 B
llm_load_print_meta: model size = 7.95 GiB (8.50 BPW)
llm_load_print_meta: general.name = Meta-Llama-3-8B-Instruct-imatrix
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128001 '<|end_of_text|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_tensors: ggml ctx size = 0.30 MiB
ggml_backend_metal_buffer_from_ptr: allocated buffer, size = 7605.34 MiB, ( 7605.41 / 10922.67)
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors: CPU buffer size = 532.31 MiB
llm_load_tensors: Metal buffer size = 7605.33 MiB
.........................................................................................
llama_new_context_with_model: n_ctx = 4096
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Pro
ggml_metal_init: picking default device: Apple M1 Pro
ggml_metal_init: using embedded metal library
ggml_metal_init: error: Error Domain=MTLLibraryErrorDomain Code=3 "Compiler encountered XPC_ERROR_CONNECTION_INVALID (is the OS shutting down?)" UserInfo={NSLocalizedDescription=Compiler encountered XPC_ERROR_CONNECTION_INVALID (is the OS shutting down?)}
llama_new_context_with_model: failed to initialize Metal backend
[04/27/24 21:34:51] ERROR ['distilabel.pipeline.local'] ❌ Failed to load step 'genWithMetLlama3': local.py:489
Failed to create llama_context
[04/27/24 21:34:52] ERROR ['distilabel.pipeline.local'] ❌ Failed to load all the steps local.py:389
i had installed llama-cpp-python with
CMAKE_ARGS="-DLLAMA_METAL_EMBED_LIBRARY=ON -DLLAMA_METAL=on" pip3 install -U --force-reinstall llama-cpp-python --no-cache-dir
but the following works
import os
from distilabel.llms.llamacpp import LlamaCppLLM
from distilabel.pipeline import Pipeline
from distilabel.steps.tasks.text_generation import TextGeneration
modelP = "/ml/models/Meta-Llama-3-8B-Instruct-GGUF/Meta-Llama-3-8B-Instruct-Q8_0.gguf"
lmcpp = LlamaCppLLM(model_path=modelP, n_gpu_layers=-1, verbose=True)
text_generation = TextGeneration(
name="text-generation",
llm=lmcpp,
input_batch_size=8,
pipeline=Pipeline(name="sample-text-pipeline"),
)
text_generation.load()
upgrading python 3.10 to 3.11 also did not work
the code seems to work if I delete my conda environment and start fresh again but strictly following the following order of pip package install
- install llama-cpp-python
- install distilabel
if i reverse the order, the error crawls up again.
Hi here @amritsingh183, thanks for reporting this issue! We'll investigate this, as we did saw problems when installing flash-attn
before vllm
for example, but didn't see problems with llamacpp-python
, but we'll try to reproduce, even though we don't use conda
so not sure whether that's dependency ordering related or just a conda
issue. If it's the first, we'll try to fix that, otherwise I'm afraid we won't be able to; but maybe adding a small section in the docs for Conda users can do the work if the latter.
Hi @amritsingh183, do you have any update on this? Internally we tried to reproduce and I was unable while both @gabrielmbmb and @plaguss could indeed reproduce and got similar issues with the Metal backend, but AFAIK since I'm able to install and we're using the same M1/2 chip, I'm afraid this will only have to do with llama-cpp-python
installation. Could you please ensure that you can indeed use llama-cpp
without distilabel
i.e. running the example at https://github.com/abetlen/llama-cpp-python/blob/main/examples/high_level_api/high_level_api_inference.py with n_gpu_layer
set to -1 so that the MPS backend is used?
Otherwise I would recommend you to double check the installation page in case you can re-install to see if that fixes the issue https://github.com/abetlen/llama-cpp-python/tree/main?tab=readme-ov-file#installation-configuration
Thanks!
Hey @alvarobartt, removing everything (delete the conda env) and starting fresh, seems to have solved the issue. may be there was some issue (where a bad version of the llamacpp got installed) which got resolved with the re-install. thanks for looking into this.