mistral.rs icon indicating copy to clipboard operation
mistral.rs copied to clipboard

I'm getting outrageously slow inference with both Text based and Visual Models

Open ljt019 opened this issue 7 months ago • 7 comments

Describe the bug

Multiple minutes even with tiny models with a 256M param vision model (smolVlm), it's not just the time loading the model into ram, because if i string prompts together it's a similar wait.

Latest commit or version

Which commit or version you ran with.

Latest

ljt019 avatar Apr 29 '25 19:04 ljt019

I used the given example code to run the model, and tried running in release as well because I know that's important for candle ^

running this model with transformers is insanely fast by comparison

ljt019 avatar Apr 29 '25 19:04 ljt019

^ same issue when i tried gemma-3-12b-it as a text model

ljt019 avatar Apr 29 '25 19:04 ljt019

Hi @ljt019! That is super strange. Does your computer have a GPU, and if so, are you compiling for it?

Taking multiple minutes for a model is definitely odd though. I'll take a look.

EricLBuehler avatar Apr 29 '25 21:04 EricLBuehler

I have a GPU, but for my initial test I just used CPU I of course expected it to take awhile, but I actually never got a response from the model after 10 min of waiting (no crash just waiting for the model to respond) I just gave up for now.

I may have less issues compiling for Gpu but I didn't give it a test yet

ljt019 avatar Apr 29 '25 21:04 ljt019

Oh yeah I should point out that I eventually did get a response from the text only model, it was still slower than I honestly expect from cpu inference compared to the same sized model on llama.cpp but idk if that's a candle limitation

ljt019 avatar Apr 29 '25 21:04 ljt019

I am experiencing something similar. I am attempting to run Qwen/Qwen3-8B with the Rust example directly on the CPU. I am performing no runtime quantization, though the result is the same when I did:

    let model = TextModelBuilder::new("Qwen/Qwen3-8B")
        .with_logging()
        .build()
        .await?;

After being very confused why nothing seemed to be happening, I added simple logging. After the first send_chat_request(), the program waits ~4-5 minutes for the first response. CPU usage maxes out ~4 cores and it consumes 16GB of RAM on a M1 Max 64GB.

Building with Metal support is faster, but not that fast. With the Metal config I would expect to see 30+t/s.

EDIT: With Qwen3-0.6B (Q8) in LM Studio, I achieve >70 t/s on CPU. With Qwen3-8B (Q8) in LM Studio, I get 10.5 t/s on CPU.

With:

mistralrs-server -i plain -m unsloth/Qwen3-0.6B-FP8

I get

Prompt: 14 tokens, 28.51 T/s
Decode: 109 tokens, 2.52 T/s

agg23 avatar May 01 '25 23:05 agg23

I tried the Rust example with Qwen3 0.6B but was too slow. In release mode (CPU only), the example ran for 400 seconds. That's weird because the outputs are approximately 400 tokens. However, llama.cpp with no AVX showed 9 tok/s.

How can I provide a more helpful report, or fix it by myself?

km19809 avatar May 09 '25 03:05 km19809

Hi @ljt019 I can't reproduce this, can you please make sure it works after:

  • cargo clean
  • git pull

cargo run --features metal -- -i run -m ...

EricLBuehler avatar Jul 29 '25 00:07 EricLBuehler

I can give it a shot, I'll let you know how it goes.

ljt019 avatar Aug 01 '25 03:08 ljt019