Inference is very slow on Mac m1
Describe the bug
I am using a fine tuned model based on microsoft's phi3.5 mini called 'Sciphi/triplex' which aims to help you extract entities and relationships from a piece of text, however it is very slow. It takes approximately 30s when it should take 3s.
Any ideas on why this would be?
Code is here:
pub fn format(text: &str) -> String {
let entity = vec!["LOCATION", "POSITION", "DATE", "CITY", "COUNTRY", "NUMBER"]
.iter()
.map(ToString::to_string)
.join(",");
let predicates = vec!["POPULATION", "AREA"]
.iter()
.map(ToString::to_string)
.join(",");
format!("
Perform Named Entity Recognition (NER) and extract knowledge graph triplets from the text. NER identifies named entities of given entity types, and triple extraction identifies relationships between entities using specified predicates.
**Entity Types:**
{entity}
**Predicates:**
{predicates}
**Text:**
{text}
")
}
impl Llm {
pub async fn new(model_id: &str) -> Self {
let model = TextModelBuilder::new(model_id)
.with_logging()
.with_paged_attn(|| {
PagedAttentionMetaBuilder::default()
.with_gpu_memory(MemoryGpuConfig::MbAmount(500))
.build()
})
.unwrap()
.build()
.await
.unwrap();
Self { model }
}
pub async fn send_message(self, message: &str) -> Result<String, String> {
let messages = TextMessages::new().add_message(TextMessageRole::User, message);
let resp = self.model.send_chat_request(messages).await.unwrap();
let res = resp.choices[0]
.message
.content
.clone()
.ok_or("fails to parse response".to_string());
if let Ok(parsed) = serde_json::from_str::<Value>(&res.clone().unwrap()) {
if let Ok(formatted) = serde_json::to_string_pretty(&parsed) {
return Ok(formatted);
}
}
res
}
}
#[cfg(test)]
mod tests {
use std::time::Instant;
use super::*;
#[tokio::test(flavor = "multi_thread", worker_threads = 1)]
async fn test_spawn_llm() {
let text = "San Francisco officially the City and County of San Francisco, is a commercial, financial, and cultural center in Northern California.With a population of 808,437 residents as of 2022, San Francisco is the fourth most populous city in the U.S. state of California behind Los Angeles, San Diego, and San Jose.";
let input = format(&text);
let llm = Llm::new("sciphi/triplex").await;
let start = Instant::now();
let text = llm.send_message(&input).await;
let end = start.elapsed();
println!("Total time: {:?}", end);
assert!(text.is_ok());
}
Latest commit or version
master branch, latest commit
Is this compiled with the metal feature enabled?
Is this compiled with the
metalfeature enabled?
Yes, I have this in my cargo toml
mistralrs = { git = "https://github.com/EricLBuehler/mistral.rs.git", branch = "master", features = ["metal"] }
Seconded. It seemed to take a nosedive about a month ago.
@hiive I think I may have a solution for your case.
On Metal, our preallocation for a large PagedAttention KV cache can cause slowdowns for some reason.
I would recommend checking out the PagedAttentionMetaBuilder::with_gpu_memory method to set the memory amount (in MB) to a reasonable amount (for example, 4096 MB). I think this should improve speeds.
@hhamud how much memory is available on your system?
@hiive I think I may have a solution for your case.
On Metal, our preallocation for a large PagedAttention KV cache can cause slowdowns for some reason.
I would recommend checking out the
PagedAttentionMetaBuilder::with_gpu_memorymethod to set the memory amount (in MB) to a reasonable amount (for example, 4096 MB). I think this should improve speeds.@hhamud how much memory is available on your system?
Mac M1 Pro 32gb
@EricLBuehler - I'll give that a try. For reference, the machine I'm running it on is a Macbook Pro - M2 Max, 96GB ram.
@EricLBuehler - I'll give that a try. For reference, the machine I'm running it on is a Macbook Pro - M2 Max, 96GB ram.
Do you have any idea on what the cause is? Even a general rough idea would be good enough here.
Do you have any idea on what the cause is? Even a general rough idea would be good enough here.
Yeah, if you allocate over the "recommended max working size" - in my experience, 75% of the total RAM in the computer, things get really slow. Not sure exactly why this is though.