mistral.rs
mistral.rs copied to clipboard
Slow CUDA inference speed
This reports mistral.rs as being faster than llama.cpp: https://github.com/EricLBuehler/mistral.rs/discussions/612
But I'm seeing much slower speeds for the same prompt/settings.
Mistral.rs
Usage { completion_tokens: 501, prompt_tokens: 28, total_tokens: 529, avg_tok_per_sec: 16.980707, avg_prompt_tok_per_sec: 76.08695, avg_compl_tok_per_sec: 16.27416, total_time_sec: 31.153, total_prompt_time_sec: 0.368, total_completion_time_sec: 30.785 }
llama.cpp
timings: {\"predicted_ms\": 4007.64, \"prompt_per_token_ms\": 0.7041786, \"predicted_per_token_ms\": 8.01528, \"prompt_ms\": 19.717, \"prompt_per_second\": 1420.0944, \"predicted_n\": 500.0, \"prompt_n\": 28.0, \"predicted_per_second\": 124.7617},
The code I'm using to init mistral.rs: https://github.com/ShelbyJenkins/llm_client/blob/b1edca89bbdc34b884907fd39be6eedabf10d81b/src/llm_backends/mistral_rs/builder.rs#L110
I'm using the basic completion tests here: https://github.com/ShelbyJenkins/llm_client/blob/b1edca89bbdc34b884907fd39be6eedabf10d81b/src/basic_completion.rs#L158
Testing on ubuntu running an ubuntu docker container (FROM nvidia/cuda:12.3.2-cudnn9-devel-ubuntu22.04). I've tried loading the layers on to a single GPU using the device dummy map, and loading on both GPUs using the device mapper. These are 3090s and testing is done with Phi 3 mini.