rust-llama.cpp
rust-llama.cpp copied to clipboard
Using metal and `n_gpu_layers` produces no tokens
I'm running the example script with a few different models:
use llama_cpp_rs::{
options::{ModelOptions, PredictOptions},
LLama,
};
pub fn llama_predict() -> Result<String, anyhow::Error> {
// metal seems to give really bad results
let model_options = ModelOptions {
//n_gpu_layers: 1,
..Default::default()
};
// let model_options = ModelOptions::default();
let llama = LLama::new(
"models/mistral-7b-instruct-v0.1.Q4_0.gguf".into(),
&model_options,
)
.unwrap();
let predict_options = PredictOptions {
//top_k: 20,
// top_p: 0.1,
// f16_kv: true,
token_callback: Some(Box::new(|token| {
println!("token: {}", token);
true
})),
..Default::default()
};
// TODO: get this working on master. Metal support is flakey.
let response = llama
.predict(
"what are the national animals of india".into(),
predict_options,
)
.unwrap();
println!("Response: {}", response);
Ok(response)
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_llama_cpp_rs() -> Result<(), anyhow::Error> {
let response = llama_predict()?;
println!("Response: {}", response);
assert!(!response.is_empty());
Ok(())
}
}
When not using metal (not using n_gpu_layers
) the models generate tokens ex:
token: ind
token: ian
token: national
token: animal
token: is
token: t
token: iger
token:
Response: indian national animal is tiger
Response: indian national animal is tiger
When I use n_gpu_layers
it does not generate tokens, ex:
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size = 64.00 MiB
llama_build_graph: non-view tensors processed: 740/740
llama_new_context_with_model: compute buffer total size = 76.07 MiB
llama_new_context_with_model: max tensor size = 102.54 MiB
count 0
token:
token:
token:
token:
...
Response:
Response:
Is this a known behavior?
Is llama.cpp actually using Metal? I tried this and noticed (only after enabling some debug logging) that in fact the file ggml-metal.metal
could not be found (it needs to be placed in the current working directory). After this the basic
example works just fine for me (and actually uses the GPU) with a Mixtral GGUF model.
I copied over the necessary metal files, otherwise I would get an error. After copying the files I encountered the no generated tokens issue.
Is llama.cpp actually using Metal? I tried this and noticed (only after enabling some debug logging) that in fact the file
ggml-metal.metal
could not be found (it needs to be placed in the current working directory). After this thebasic
example works just fine for me (and actually uses the GPU) with a Mixtral GGUF model.
AFAIK it does: https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#metal-build
llama-cpp-python requires the user to specify CMAKE_ARGS
when during pip install
: https://llama-cpp-python.readthedocs.io/en/latest/install/macos/
Do users need to do something similar during cargo install
for this crate?
Reading through here, it seems like llama.cpp needs to be built with specific flags in order for metal support to work: https://github.com/ggerganov/llama.cpp/pull/1642