mistral.rs icon indicating copy to clipboard operation
mistral.rs copied to clipboard

Inference is very slow on Mac m1

Open hhamud opened this issue 11 months ago • 8 comments

Describe the bug

I am using a fine tuned model based on microsoft's phi3.5 mini called 'Sciphi/triplex' which aims to help you extract entities and relationships from a piece of text, however it is very slow. It takes approximately 30s when it should take 3s.

Any ideas on why this would be?

Code is here:

pub fn format(text: &str) -> String {
    let entity = vec!["LOCATION", "POSITION", "DATE", "CITY", "COUNTRY", "NUMBER"]
        .iter()
        .map(ToString::to_string)
        .join(",");
    let predicates = vec!["POPULATION", "AREA"]
        .iter()
        .map(ToString::to_string)
        .join(",");

    format!("
Perform Named Entity Recognition (NER) and extract knowledge graph triplets from the text. NER identifies named entities of given entity types, and triple extraction identifies relationships between entities using specified predicates.

        **Entity Types:**
        {entity}

        **Predicates:**
        {predicates}

        **Text:**
        {text}
")
}

impl Llm {
    pub async fn new(model_id: &str) -> Self {
        let model = TextModelBuilder::new(model_id)
            .with_logging()
            .with_paged_attn(|| {
                PagedAttentionMetaBuilder::default()
                    .with_gpu_memory(MemoryGpuConfig::MbAmount(500))
                    .build()
            })
            .unwrap()
            .build()
            .await
            .unwrap();

        Self { model }
    }

    pub async fn send_message(self, message: &str) -> Result<String, String> {
        let messages = TextMessages::new().add_message(TextMessageRole::User, message);

        let resp = self.model.send_chat_request(messages).await.unwrap();
        let res = resp.choices[0]
            .message
            .content
            .clone()
            .ok_or("fails to parse response".to_string());

        if let Ok(parsed) = serde_json::from_str::<Value>(&res.clone().unwrap()) {
            if let Ok(formatted) = serde_json::to_string_pretty(&parsed) {
                return Ok(formatted);
            }
        }

        res
    }
}

#[cfg(test)]
mod tests {
    use std::time::Instant;

    use super::*;

    #[tokio::test(flavor = "multi_thread", worker_threads = 1)]
    async fn test_spawn_llm() {
        let text = "San Francisco officially the City and County of San Francisco, is a commercial, financial, and cultural center in Northern California.With a population of 808,437 residents as of 2022, San Francisco is the fourth most populous city in the U.S. state of California behind Los Angeles, San Diego, and San Jose.";
        let input = format(&text);
        let llm = Llm::new("sciphi/triplex").await;
        let start = Instant::now();
        let text = llm.send_message(&input).await;
        let end = start.elapsed();
        println!("Total time: {:?}", end);
        assert!(text.is_ok());
    }

Latest commit or version

master branch, latest commit

hhamud avatar Jan 16 '25 02:01 hhamud

Is this compiled with the metal feature enabled?

cdoko avatar Jan 16 '25 10:01 cdoko

Is this compiled with the metal feature enabled?

Yes, I have this in my cargo toml

mistralrs = { git = "https://github.com/EricLBuehler/mistral.rs.git", branch = "master", features = ["metal"] }

hhamud avatar Jan 16 '25 14:01 hhamud

Seconded. It seemed to take a nosedive about a month ago.

hiive avatar Feb 12 '25 06:02 hiive

@hiive I think I may have a solution for your case.

On Metal, our preallocation for a large PagedAttention KV cache can cause slowdowns for some reason.

I would recommend checking out the PagedAttentionMetaBuilder::with_gpu_memory method to set the memory amount (in MB) to a reasonable amount (for example, 4096 MB). I think this should improve speeds.

@hhamud how much memory is available on your system?

EricLBuehler avatar Feb 12 '25 21:02 EricLBuehler

@hiive I think I may have a solution for your case.

On Metal, our preallocation for a large PagedAttention KV cache can cause slowdowns for some reason.

I would recommend checking out the PagedAttentionMetaBuilder::with_gpu_memory method to set the memory amount (in MB) to a reasonable amount (for example, 4096 MB). I think this should improve speeds.

@hhamud how much memory is available on your system?

Mac M1 Pro 32gb

hhamud avatar Feb 21 '25 15:02 hhamud

@EricLBuehler - I'll give that a try. For reference, the machine I'm running it on is a Macbook Pro - M2 Max, 96GB ram.

hiive avatar Mar 21 '25 14:03 hiive

@EricLBuehler - I'll give that a try. For reference, the machine I'm running it on is a Macbook Pro - M2 Max, 96GB ram.

Do you have any idea on what the cause is? Even a general rough idea would be good enough here.

hhamud avatar Mar 22 '25 02:03 hhamud

Do you have any idea on what the cause is? Even a general rough idea would be good enough here.

Yeah, if you allocate over the "recommended max working size" - in my experience, 75% of the total RAM in the computer, things get really slow. Not sure exactly why this is though.

EricLBuehler avatar Mar 22 '25 02:03 EricLBuehler