distilabel icon indicating copy to clipboard operation
distilabel copied to clipboard

Unable to use Llamacpp

Open amritsingh183 opened this issue 10 months ago • 3 comments

when i run the following code on my m1 macbook pro (mac os 14.4.1 (23E224))

from distilabel.pipeline import Pipeline
from distilabel.llms.llamacpp import LlamaCppLLM
from distilabel.steps import LoadDataFromDicts
from distilabel.steps.tasks import TextGeneration
from datasets import load_dataset

datasts = load_dataset(
  "/ml/datasets/10k_prompts_ranked",
    split="train"
).filter(
  lambda r: r['avg_rating']>=4 and r['num_responses']>=2
)
datastsLst = datasts.to_list()
with Pipeline(name="text-generation-pipeline") as pipeline:
    load_dataset = LoadDataFromDicts(
        name="load_dataset",
        data=datastsLst[0:10],
        output_mappings={"prompt": "instruction"},
    )
    text_generation = TextGeneration(
        name="text_generation",
        llm=LlamaCppLLM(
            model_path="/ml/models/Meta-Llama-3-8B-Instruct-GGUF/Meta-Llama-3-8B-Instruct-Q8_0.gguf",
            n_gpu_layers=-1,        
        ),
    )
    load_dataset.connect(text_generation)

if __name__ == "__main__":
    pipeline.run(parameters={"text_generation": {"llm": {"generation_kwargs": {"temperature": 0.9,"n_ctx":2048}}}})

I am getting the following error

llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from /ml/models/Meta-Llama-3-8B-Instruct-GGUF/Meta-Llama-3-8B-Instruct-Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Meta-Llama-3-8B-Instruct-imatrix
llama_model_loader: - kv   2:                          llama.block_count u32              = 32
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 7
llama_model_loader: - kv  11:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  16:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  17:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  18:                tokenizer.ggml.eos_token_id u32              = 128001
llama_model_loader: - kv  19:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  20:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q8_0:  226 tensors
llm_load_vocab: special tokens definition check successful ( 256/128256 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q8_0
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 7.95 GiB (8.50 BPW) 
llm_load_print_meta: general.name     = Meta-Llama-3-8B-Instruct-imatrix
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128001 '<|end_of_text|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_tensors: ggml ctx size =    0.30 MiB
ggml_backend_metal_buffer_from_ptr: allocated buffer, size =  7605.34 MiB, ( 7605.41 / 10922.67)
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =   532.31 MiB
llm_load_tensors:      Metal buffer size =  7605.33 MiB
.........................................................................................
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Pro
ggml_metal_init: picking default device: Apple M1 Pro
ggml_metal_init: using embedded metal library
ggml_metal_init: error: Error Domain=MTLLibraryErrorDomain Code=3 "Compiler encountered XPC_ERROR_CONNECTION_INVALID (is the OS shutting down?)" UserInfo={NSLocalizedDescription=Compiler encountered XPC_ERROR_CONNECTION_INVALID (is the OS shutting down?)}
llama_new_context_with_model: failed to initialize Metal backend

[04/27/24 21:34:51] ERROR    ['distilabel.pipeline.local'] ❌ Failed to load step 'genWithMetLlama3':  local.py:489
                             Failed to create llama_context                                                        

[04/27/24 21:34:52] ERROR    ['distilabel.pipeline.local'] ❌ Failed to load all the steps             local.py:389



i had installed llama-cpp-python with

CMAKE_ARGS="-DLLAMA_METAL_EMBED_LIBRARY=ON -DLLAMA_METAL=on" pip3 install -U --force-reinstall llama-cpp-python --no-cache-dir

but the following works

import os

from distilabel.llms.llamacpp import LlamaCppLLM
from distilabel.pipeline import Pipeline
from distilabel.steps.tasks.text_generation import TextGeneration

modelP = "/ml/models/Meta-Llama-3-8B-Instruct-GGUF/Meta-Llama-3-8B-Instruct-Q8_0.gguf"

lmcpp = LlamaCppLLM(model_path=modelP, n_gpu_layers=-1, verbose=True)

text_generation = TextGeneration(
    name="text-generation",
    llm=lmcpp,
    input_batch_size=8,
    pipeline=Pipeline(name="sample-text-pipeline"),
)
text_generation.load()

amritsingh183 avatar Apr 27 '24 12:04 amritsingh183

upgrading python 3.10 to 3.11 also did not work

amritsingh183 avatar Apr 27 '24 12:04 amritsingh183

the code seems to work if I delete my conda environment and start fresh again but strictly following the following order of pip package install

  1. install llama-cpp-python
  2. install distilabel

if i reverse the order, the error crawls up again.

amritsingh183 avatar Apr 28 '24 09:04 amritsingh183

Hi here @amritsingh183, thanks for reporting this issue! We'll investigate this, as we did saw problems when installing flash-attn before vllm for example, but didn't see problems with llamacpp-python, but we'll try to reproduce, even though we don't use conda so not sure whether that's dependency ordering related or just a conda issue. If it's the first, we'll try to fix that, otherwise I'm afraid we won't be able to; but maybe adding a small section in the docs for Conda users can do the work if the latter.

alvarobartt avatar Apr 28 '24 11:04 alvarobartt

Hi @amritsingh183, do you have any update on this? Internally we tried to reproduce and I was unable while both @gabrielmbmb and @plaguss could indeed reproduce and got similar issues with the Metal backend, but AFAIK since I'm able to install and we're using the same M1/2 chip, I'm afraid this will only have to do with llama-cpp-python installation. Could you please ensure that you can indeed use llama-cpp without distilabel i.e. running the example at https://github.com/abetlen/llama-cpp-python/blob/main/examples/high_level_api/high_level_api_inference.py with n_gpu_layer set to -1 so that the MPS backend is used?

Otherwise I would recommend you to double check the installation page in case you can re-install to see if that fixes the issue https://github.com/abetlen/llama-cpp-python/tree/main?tab=readme-ov-file#installation-configuration

Thanks!

alvarobartt avatar May 02 '24 09:05 alvarobartt

Hey @alvarobartt, removing everything (delete the conda env) and starting fresh, seems to have solved the issue. may be there was some issue (where a bad version of the llamacpp got installed) which got resolved with the re-install. thanks for looking into this.

amritsingh183 avatar May 02 '24 11:05 amritsingh183