mistral.rs
mistral.rs copied to clipboard
Benching local GGUF model layers allocated to vRAM but no GPU activity
Describe the bug
Building mistral.rs
with the cuda
feature, when I test it with mistralrs-bench
and a local GGUF I observed via nvidia-smi
that layers were allocated to vRAM, but GPU activity was 0 after warmup.
Despite this, within the same environment (llama-cpp
official Dockerfile
for full-cuda
variant), the equivalent llama-cpp
bench tool worked using the GPU at 100%. I built both projects within the same container environment myself, so something is off?
More details here: https://github.com/EricLBuehler/mistral.rs/issues/329#issuecomment-2119078793
I can look at running the Dockerfile
from this project, but besides cudnn
, there shouldn't be much difference AFAIK. I've not tried other commands, or non-gguf, but assume that shouldn't affect this?
Latest commit
v0.1.8: https://github.com/EricLBuehler/mistral.rs/commit/ca9bf7d1a8a67bd69a3eed89841a106d2e518c45
Additional context
There is a modification I've applied to be able to load the local models without an HF token provided (I don't have an account yet and just wanted to try some projects with models), my workaround was to ignore 401 (unauthorized) similar to how 404 is ignored.
AFAIK this shouldn't affect using the GGUF model negatively? Additional files had to be provided despite this not being required by llama-cpp
, from what I understand all the relevant metadata is already available with the GGUF file itself?