mistral.rs icon indicating copy to clipboard operation
mistral.rs copied to clipboard

Benching local GGUF model layers allocated to vRAM but no GPU activity

Open polarathene opened this issue 9 months ago • 1 comments

Describe the bug

Building mistral.rs with the cuda feature, when I test it with mistralrs-bench and a local GGUF I observed via nvidia-smi that layers were allocated to vRAM, but GPU activity was 0 after warmup.

Despite this, within the same environment (llama-cpp official Dockerfile for full-cuda variant), the equivalent llama-cpp bench tool worked using the GPU at 100%. I built both projects within the same container environment myself, so something is off?

More details here: https://github.com/EricLBuehler/mistral.rs/issues/329#issuecomment-2119078793

I can look at running the Dockerfile from this project, but besides cudnn, there shouldn't be much difference AFAIK. I've not tried other commands, or non-gguf, but assume that shouldn't affect this?

Latest commit

v0.1.8: https://github.com/EricLBuehler/mistral.rs/commit/ca9bf7d1a8a67bd69a3eed89841a106d2e518c45

Additional context

There is a modification I've applied to be able to load the local models without an HF token provided (I don't have an account yet and just wanted to try some projects with models), my workaround was to ignore 401 (unauthorized) similar to how 404 is ignored.

AFAIK this shouldn't affect using the GGUF model negatively? Additional files had to be provided despite this not being required by llama-cpp, from what I understand all the relevant metadata is already available with the GGUF file itself?

polarathene avatar May 19 '24 03:05 polarathene