rust-llama.cpp Using metal and `n_gpu_layers` produces no tokens

I'm running the example script with a few different models:

use llama_cpp_rs::{
    options::{ModelOptions, PredictOptions},
    LLama,
};

pub fn llama_predict() -> Result<String, anyhow::Error> {
    
    // metal seems to give really bad results
    let model_options = ModelOptions {
          //n_gpu_layers: 1,
        ..Default::default()
    };
    
    // let model_options = ModelOptions::default();

    let llama = LLama::new(
        "models/mistral-7b-instruct-v0.1.Q4_0.gguf".into(),
        &model_options,
    )
    .unwrap();

    let predict_options = PredictOptions {
        //top_k: 20,
        // top_p: 0.1,
        // f16_kv: true,

        token_callback: Some(Box::new(|token| {
            println!("token: {}", token);
            true
        })),
        ..Default::default()
    };

    // TODO: get this working on master. Metal support is flakey.
    let response = llama
        .predict(
            "what are the national animals of india".into(),
             predict_options,
        )
        .unwrap();
    println!("Response: {}", response);
    Ok(response)
}


#[cfg(test)]
mod tests {
    use super::*;
    #[test]
    fn test_llama_cpp_rs() -> Result<(), anyhow::Error> {
        let response = llama_predict()?;
        println!("Response: {}", response);
        assert!(!response.is_empty());
        Ok(())
    }
}

When not using metal (not using n_gpu_layers) the models generate tokens ex:

token: ind
token: ian
token:  national
token:  animal
token:  is
token:  t
token: iger
token: 
Response: indian national animal is tiger
Response: indian national animal is tiger

When I use n_gpu_layers it does not generate tokens, ex:

llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size  =   64.00 MiB
llama_build_graph: non-view tensors processed: 740/740
llama_new_context_with_model: compute buffer total size = 76.07 MiB
llama_new_context_with_model: max tensor size =   102.54 MiB
count 0
token:
token:
token:
token:
...
Response:
Response:

Is this a known behavior?

Jan 04 '24 20:01 jasonw247

Is llama.cpp actually using Metal? I tried this and noticed (only after enabling some debug logging) that in fact the file ggml-metal.metal could not be found (it needs to be placed in the current working directory). After this the basic example works just fine for me (and actually uses the GPU) with a Mixtral GGUF model.

Jan 14 '24 17:01 pixelspark

I copied over the necessary metal files, otherwise I would get an error. After copying the files I encountered the no generated tokens issue.

Jan 16 '24 16:01 jasonw247

Is llama.cpp actually using Metal? I tried this and noticed (only after enabling some debug logging) that in fact the file ggml-metal.metal could not be found (it needs to be placed in the current working directory). After this the basic example works just fine for me (and actually uses the GPU) with a Mixtral GGUF model.

AFAIK it does: https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#metal-build

Feb 11 '24 18:02 shaqq

llama-cpp-python requires the user to specify CMAKE_ARGS when during pip install: https://llama-cpp-python.readthedocs.io/en/latest/install/macos/

Do users need to do something similar during cargo install for this crate?

Feb 11 '24 18:02 shaqq

Reading through here, it seems like llama.cpp needs to be built with specific flags in order for metal support to work: https://github.com/ggerganov/llama.cpp/pull/1642

Feb 17 '24 21:02 mikecvet

rust-llama.cpp rust-llama.cpp copied to clipboard

Using metal and `n_gpu_layers` produces no tokens

rust-llama.cpp
rust-llama.cpp copied to clipboard