I'm getting outrageously slow inference with both Text based and Visual Models
Describe the bug
Multiple minutes even with tiny models with a 256M param vision model (smolVlm), it's not just the time loading the model into ram, because if i string prompts together it's a similar wait.
Latest commit or version
Which commit or version you ran with.
Latest
I used the given example code to run the model, and tried running in release as well because I know that's important for candle ^
running this model with transformers is insanely fast by comparison
^ same issue when i tried gemma-3-12b-it as a text model
Hi @ljt019! That is super strange. Does your computer have a GPU, and if so, are you compiling for it?
Taking multiple minutes for a model is definitely odd though. I'll take a look.
I have a GPU, but for my initial test I just used CPU I of course expected it to take awhile, but I actually never got a response from the model after 10 min of waiting (no crash just waiting for the model to respond) I just gave up for now.
I may have less issues compiling for Gpu but I didn't give it a test yet
Oh yeah I should point out that I eventually did get a response from the text only model, it was still slower than I honestly expect from cpu inference compared to the same sized model on llama.cpp but idk if that's a candle limitation
I am experiencing something similar. I am attempting to run Qwen/Qwen3-8B with the Rust example directly on the CPU. I am performing no runtime quantization, though the result is the same when I did:
let model = TextModelBuilder::new("Qwen/Qwen3-8B")
.with_logging()
.build()
.await?;
After being very confused why nothing seemed to be happening, I added simple logging. After the first send_chat_request(), the program waits ~4-5 minutes for the first response. CPU usage maxes out ~4 cores and it consumes 16GB of RAM on a M1 Max 64GB.
Building with Metal support is faster, but not that fast. With the Metal config I would expect to see 30+t/s.
EDIT: With Qwen3-0.6B (Q8) in LM Studio, I achieve >70 t/s on CPU. With Qwen3-8B (Q8) in LM Studio, I get 10.5 t/s on CPU.
With:
mistralrs-server -i plain -m unsloth/Qwen3-0.6B-FP8
I get
Prompt: 14 tokens, 28.51 T/s
Decode: 109 tokens, 2.52 T/s
I tried the Rust example with Qwen3 0.6B but was too slow. In release mode (CPU only), the example ran for 400 seconds. That's weird because the outputs are approximately 400 tokens. However, llama.cpp with no AVX showed 9 tok/s.
How can I provide a more helpful report, or fix it by myself?
Hi @ljt019 I can't reproduce this, can you please make sure it works after:
cargo cleangit pull
cargo run --features metal -- -i run -m ...
I can give it a shot, I'll let you know how it goes.