mistral.rs
mistral.rs copied to clipboard
Running model from a GGUF file, only
Describe the bug
Running model from a GGUF file using llama.cpp is very straightforward, just like that:
server -v -ngl 99 -m Phi-3-mini-4k-instruct-Q6_K.gguf
and if model is supported, it just works.
I tried to do the same using mistral.rs, and I've got that:
mistralrs-server gguf -m . -t . -f .\Phi-3-mini-4k-instruct-Q6_K.gguf
2024-05-17T07:21:48.581660Z INFO mistralrs_server: avx: true, neon: false, simd128: false, f16c: true
2024-05-17T07:21:48.581743Z INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> multinomial
2024-05-17T07:21:48.581820Z INFO mistralrs_server: Loading model `.` on Cuda(CudaDevice(DeviceId(1)))...
2024-05-17T07:21:48.581873Z INFO mistralrs_server: Model kind is: quantized from gguf (no adapters)
2024-05-17T07:21:48.583625Z INFO hf_hub: Token file not found "C:\\Users\\[REDACTED]\\.cache\\huggingface\\token"
2024-05-17T07:21:48.583707Z INFO mistralrs_core::utils::tokens: Could not load token at "C:\\Users\\[REDACTED]/.cache/huggingface/token", using no HF token.
thread 'main' panicked at mistralrs-core\src\pipeline\gguf.rs:290:58:
File "tokenizer.json" not found at model id "."
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Why it asks me for a tokenizer file, when it's included in the GGUF file? I understand having this as an option (if I wanted to try out different tokenizer / configuration), but by default it should just use information provided in the gguf file itself.
Next attempt, when I copied tokenizer.json from the original model repo:
mistralrs-server gguf -m . -t . -f .\Phi-3-mini-4k-instruct-Q6_K.gguf
2024-05-17T07:28:34.987235Z INFO mistralrs_server: avx: true, neon: false, simd128: false, f16c: true
2024-05-17T07:28:34.987332Z INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> multinomial
2024-05-17T07:28:34.987382Z INFO mistralrs_server: Loading model `.` on Cuda(CudaDevice(DeviceId(1)))...
2024-05-17T07:28:34.987431Z INFO mistralrs_server: Model kind is: quantized from gguf (no adapters)
2024-05-17T07:28:34.989190Z INFO hf_hub: Token file not found "C:\\Users\\[REDACTED]\\.cache\\huggingface\\token"
2024-05-17T07:28:34.989270Z INFO mistralrs_core::utils::tokens: Could not load token at "C:\\Users\\[REDACTED]/.cache/huggingface/token", using no HF token.
2024-05-17T07:28:35.532371Z INFO mistralrs_core::pipeline::gguf: Loading `"tokenizer.json"` locally at `".\\tokenizer.json"`
thread 'main' panicked at mistralrs-core\src\pipeline\gguf.rs:290:58:
File "config.json" not found at model id "."
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
And another attempt, after copying config.json (which I think is also unnecessary, as llama.cpp works fine without it):
mistralrs-server gguf -m . -t . -f .\Phi-3-mini-4k-instruct-Q6_K.gguf
2024-05-17T07:30:00.352139Z INFO mistralrs_server: avx: true, neon: false, simd128: false, f16c: true
2024-05-17T07:30:00.352236Z INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> multinomial
2024-05-17T07:30:00.352301Z INFO mistralrs_server: Loading model `.` on Cuda(CudaDevice(DeviceId(1)))...
2024-05-17T07:30:00.352344Z INFO mistralrs_server: Model kind is: quantized from gguf (no adapters)
2024-05-17T07:30:00.354085Z INFO hf_hub: Token file not found "C:\\Users\\[REDACTED]\\.cache\\huggingface\\token"
2024-05-17T07:30:00.354168Z INFO mistralrs_core::utils::tokens: Could not load token at "C:\\Users\\[REDACTED]/.cache/huggingface/token", using no HF token.
2024-05-17T07:30:00.601055Z INFO mistralrs_core::pipeline::gguf: Loading `"tokenizer.json"` locally at `".\\tokenizer.json"`
2024-05-17T07:30:00.814258Z INFO mistralrs_core::pipeline::gguf: Loading `"config.json"` locally at `".\\config.json"`
2024-05-17T07:30:00.814412Z INFO hf_hub: Token file not found "C:\\Users\\[REDACTED]\\.cache\\huggingface\\token"
2024-05-17T07:30:00.814505Z INFO mistralrs_core::utils::tokens: Could not load token at "C:\\Users\\[REDACTED]/.cache/huggingface/token", using no HF token.
2024-05-17T07:30:01.022055Z INFO mistralrs_core::pipeline: Loading `".\\Phi-3-mini-4k-instruct-Q6_K.gguf"` locally at `".\\.\\Phi-3-mini-4k-instruct-Q6_K.gguf"`
thread 'main' panicked at mistralrs-core\src\pipeline\gguf.rs:290:58:
RequestError(Status(401, Response[status: 401, status_text: Unauthorized, url: https://huggingface.co/api/models/revision/main]))
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
I wanted to give mistral.rs a shot, but it's a really painful experience for now.
Latest commit ca9bf7d1a8a67bd69a3eed89841a106d2e518c45 (v0.1.8)