mistral.rs Running model from a GGUF file, only

Running model from a GGUF file, only

Open MoonRide303 opened this issue 9 months ago • 46 comments

Describe the bug Running model from a GGUF file using llama.cpp is very straightforward, just like that: server -v -ngl 99 -m Phi-3-mini-4k-instruct-Q6_K.gguf and if model is supported, it just works.

I tried to do the same using mistral.rs, and I've got that:

mistralrs-server gguf -m . -t . -f .\Phi-3-mini-4k-instruct-Q6_K.gguf
2024-05-17T07:21:48.581660Z  INFO mistralrs_server: avx: true, neon: false, simd128: false, f16c: true
2024-05-17T07:21:48.581743Z  INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> multinomial
2024-05-17T07:21:48.581820Z  INFO mistralrs_server: Loading model `.` on Cuda(CudaDevice(DeviceId(1)))...
2024-05-17T07:21:48.581873Z  INFO mistralrs_server: Model kind is: quantized from gguf (no adapters)
2024-05-17T07:21:48.583625Z  INFO hf_hub: Token file not found "C:\\Users\\[REDACTED]\\.cache\\huggingface\\token"
2024-05-17T07:21:48.583707Z  INFO mistralrs_core::utils::tokens: Could not load token at "C:\\Users\\[REDACTED]/.cache/huggingface/token", using no HF token.
thread 'main' panicked at mistralrs-core\src\pipeline\gguf.rs:290:58:
File "tokenizer.json" not found at model id "."
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Why it asks me for a tokenizer file, when it's included in the GGUF file? I understand having this as an option (if I wanted to try out different tokenizer / configuration), but by default it should just use information provided in the gguf file itself.

Next attempt, when I copied tokenizer.json from the original model repo:

mistralrs-server gguf -m . -t . -f .\Phi-3-mini-4k-instruct-Q6_K.gguf
2024-05-17T07:28:34.987235Z  INFO mistralrs_server: avx: true, neon: false, simd128: false, f16c: true
2024-05-17T07:28:34.987332Z  INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> multinomial
2024-05-17T07:28:34.987382Z  INFO mistralrs_server: Loading model `.` on Cuda(CudaDevice(DeviceId(1)))...
2024-05-17T07:28:34.987431Z  INFO mistralrs_server: Model kind is: quantized from gguf (no adapters)
2024-05-17T07:28:34.989190Z  INFO hf_hub: Token file not found "C:\\Users\\[REDACTED]\\.cache\\huggingface\\token"
2024-05-17T07:28:34.989270Z  INFO mistralrs_core::utils::tokens: Could not load token at "C:\\Users\\[REDACTED]/.cache/huggingface/token", using no HF token.
2024-05-17T07:28:35.532371Z  INFO mistralrs_core::pipeline::gguf: Loading `"tokenizer.json"` locally at `".\\tokenizer.json"`
thread 'main' panicked at mistralrs-core\src\pipeline\gguf.rs:290:58:
File "config.json" not found at model id "."
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

And another attempt, after copying config.json (which I think is also unnecessary, as llama.cpp works fine without it):

mistralrs-server gguf -m . -t . -f .\Phi-3-mini-4k-instruct-Q6_K.gguf
2024-05-17T07:30:00.352139Z  INFO mistralrs_server: avx: true, neon: false, simd128: false, f16c: true
2024-05-17T07:30:00.352236Z  INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> multinomial
2024-05-17T07:30:00.352301Z  INFO mistralrs_server: Loading model `.` on Cuda(CudaDevice(DeviceId(1)))...
2024-05-17T07:30:00.352344Z  INFO mistralrs_server: Model kind is: quantized from gguf (no adapters)
2024-05-17T07:30:00.354085Z  INFO hf_hub: Token file not found "C:\\Users\\[REDACTED]\\.cache\\huggingface\\token"
2024-05-17T07:30:00.354168Z  INFO mistralrs_core::utils::tokens: Could not load token at "C:\\Users\\[REDACTED]/.cache/huggingface/token", using no HF token.
2024-05-17T07:30:00.601055Z  INFO mistralrs_core::pipeline::gguf: Loading `"tokenizer.json"` locally at `".\\tokenizer.json"`
2024-05-17T07:30:00.814258Z  INFO mistralrs_core::pipeline::gguf: Loading `"config.json"` locally at `".\\config.json"`
2024-05-17T07:30:00.814412Z  INFO hf_hub: Token file not found "C:\\Users\\[REDACTED]\\.cache\\huggingface\\token"
2024-05-17T07:30:00.814505Z  INFO mistralrs_core::utils::tokens: Could not load token at "C:\\Users\\[REDACTED]/.cache/huggingface/token", using no HF token.
2024-05-17T07:30:01.022055Z  INFO mistralrs_core::pipeline: Loading `".\\Phi-3-mini-4k-instruct-Q6_K.gguf"` locally at `".\\.\\Phi-3-mini-4k-instruct-Q6_K.gguf"`
thread 'main' panicked at mistralrs-core\src\pipeline\gguf.rs:290:58:
RequestError(Status(401, Response[status: 401, status_text: Unauthorized, url: https://huggingface.co/api/models/revision/main]))
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

I wanted to give mistral.rs a shot, but it's a really painful experience for now.

Latest commit ca9bf7d1a8a67bd69a3eed89841a106d2e518c45 (v0.1.8)

May 17 '24 07:05 MoonRide303

mistral.rs mistral.rs copied to clipboard

Running model from a GGUF file, only

mistral.rs
mistral.rs copied to clipboard