mistral.rs
mistral.rs copied to clipboard
Running model from a GGUF file, only
Describe the bug
Running model from a GGUF file using llama.cpp is very straightforward, just like that:
server -v -ngl 99 -m Phi-3-mini-4k-instruct-Q6_K.gguf
and if model is supported, it just works.
I tried to do the same using mistral.rs, and I've got that:
mistralrs-server gguf -m . -t . -f .\Phi-3-mini-4k-instruct-Q6_K.gguf
2024-05-17T07:21:48.581660Z INFO mistralrs_server: avx: true, neon: false, simd128: false, f16c: true
2024-05-17T07:21:48.581743Z INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> multinomial
2024-05-17T07:21:48.581820Z INFO mistralrs_server: Loading model `.` on Cuda(CudaDevice(DeviceId(1)))...
2024-05-17T07:21:48.581873Z INFO mistralrs_server: Model kind is: quantized from gguf (no adapters)
2024-05-17T07:21:48.583625Z INFO hf_hub: Token file not found "C:\\Users\\[REDACTED]\\.cache\\huggingface\\token"
2024-05-17T07:21:48.583707Z INFO mistralrs_core::utils::tokens: Could not load token at "C:\\Users\\[REDACTED]/.cache/huggingface/token", using no HF token.
thread 'main' panicked at mistralrs-core\src\pipeline\gguf.rs:290:58:
File "tokenizer.json" not found at model id "."
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Why it asks me for a tokenizer file, when it's included in the GGUF file? I understand having this as an option (if I wanted to try out different tokenizer / configuration), but by default it should just use information provided in the gguf file itself.
Next attempt, when I copied tokenizer.json from the original model repo:
mistralrs-server gguf -m . -t . -f .\Phi-3-mini-4k-instruct-Q6_K.gguf
2024-05-17T07:28:34.987235Z INFO mistralrs_server: avx: true, neon: false, simd128: false, f16c: true
2024-05-17T07:28:34.987332Z INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> multinomial
2024-05-17T07:28:34.987382Z INFO mistralrs_server: Loading model `.` on Cuda(CudaDevice(DeviceId(1)))...
2024-05-17T07:28:34.987431Z INFO mistralrs_server: Model kind is: quantized from gguf (no adapters)
2024-05-17T07:28:34.989190Z INFO hf_hub: Token file not found "C:\\Users\\[REDACTED]\\.cache\\huggingface\\token"
2024-05-17T07:28:34.989270Z INFO mistralrs_core::utils::tokens: Could not load token at "C:\\Users\\[REDACTED]/.cache/huggingface/token", using no HF token.
2024-05-17T07:28:35.532371Z INFO mistralrs_core::pipeline::gguf: Loading `"tokenizer.json"` locally at `".\\tokenizer.json"`
thread 'main' panicked at mistralrs-core\src\pipeline\gguf.rs:290:58:
File "config.json" not found at model id "."
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
And another attempt, after copying config.json (which I think is also unnecessary, as llama.cpp works fine without it):
mistralrs-server gguf -m . -t . -f .\Phi-3-mini-4k-instruct-Q6_K.gguf
2024-05-17T07:30:00.352139Z INFO mistralrs_server: avx: true, neon: false, simd128: false, f16c: true
2024-05-17T07:30:00.352236Z INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> multinomial
2024-05-17T07:30:00.352301Z INFO mistralrs_server: Loading model `.` on Cuda(CudaDevice(DeviceId(1)))...
2024-05-17T07:30:00.352344Z INFO mistralrs_server: Model kind is: quantized from gguf (no adapters)
2024-05-17T07:30:00.354085Z INFO hf_hub: Token file not found "C:\\Users\\[REDACTED]\\.cache\\huggingface\\token"
2024-05-17T07:30:00.354168Z INFO mistralrs_core::utils::tokens: Could not load token at "C:\\Users\\[REDACTED]/.cache/huggingface/token", using no HF token.
2024-05-17T07:30:00.601055Z INFO mistralrs_core::pipeline::gguf: Loading `"tokenizer.json"` locally at `".\\tokenizer.json"`
2024-05-17T07:30:00.814258Z INFO mistralrs_core::pipeline::gguf: Loading `"config.json"` locally at `".\\config.json"`
2024-05-17T07:30:00.814412Z INFO hf_hub: Token file not found "C:\\Users\\[REDACTED]\\.cache\\huggingface\\token"
2024-05-17T07:30:00.814505Z INFO mistralrs_core::utils::tokens: Could not load token at "C:\\Users\\[REDACTED]/.cache/huggingface/token", using no HF token.
2024-05-17T07:30:01.022055Z INFO mistralrs_core::pipeline: Loading `".\\Phi-3-mini-4k-instruct-Q6_K.gguf"` locally at `".\\.\\Phi-3-mini-4k-instruct-Q6_K.gguf"`
thread 'main' panicked at mistralrs-core\src\pipeline\gguf.rs:290:58:
RequestError(Status(401, Response[status: 401, status_text: Unauthorized, url: https://huggingface.co/api/models/revision/main]))
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
I wanted to give mistral.rs a shot, but it's a really painful experience for now.
Latest commit ca9bf7d1a8a67bd69a3eed89841a106d2e518c45 (v0.1.8)
Did you add the HuggingFace Token? I got the same error RequestError(Status(401, Response[status: 401, status_text: Unauthorized, url: https://huggingface.co/api/models/revision/main])) until I added the token.
Here are the ways you can add it https://github.com/EricLBuehler/mistral.rs?tab=readme-ov-file#getting-models-from-hf-hub.
Also, this chat helped me as I was getting a 403 Error after that https://discuss.huggingface.co/t/error-403-what-to-do-about-it/12983. I had to accept the Llama license.
@joshpopelka20 I want to run model from a local GGUF file, only - exactly the same way as in llama.cpp. Communication with HF (or any other) servers shouldn't ever be required for that.
A recent issue also showed UX issues with this: https://github.com/EricLBuehler/mistral.rs/issues/295#issuecomment-2106450931
UPDATE: This local model support may have been a very new feature it seems, which might explain the current UX issues: https://github.com/EricLBuehler/mistral.rs/pull/308
I found the README a bit confusing too vs llama-cpp for local GGUF (it doesn't help that it refers to terms you need to configure, but then uses short option names, the linked CLI args output also appears outdated from what a git build shows).
I was not able to use absolute or relative paths in a way that mistral.rs would seem to understand / accept, so based on the linked issue above, I had to ensure the binary was adjacent to the model (and forced tokenizer.json + config.json files)...
Still like you it fails, but here is the extra output for knowing why:
$ RUST_BACKTRACE=1 ./mistralrs-server --token-source none gguf -m . -t . -f model.gguf
2024-05-18T01:41:28.388727Z INFO mistralrs_server: avx: true, neon: false, simd128: false, f16c: true
2024-05-18T01:41:28.388775Z INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> multinomial
2024-05-18T01:41:28.388781Z INFO mistralrs_server: Loading model `.` on Cuda(CudaDevice(DeviceId(1)))...
2024-05-18T01:41:28.388828Z INFO mistralrs_server: Model kind is: quantized from gguf (no adapters)
2024-05-18T01:41:28.388869Z INFO hf_hub: Token file not found "/root/.cache/huggingface/token"
2024-05-18T01:41:28.658484Z INFO mistralrs_core::pipeline::gguf: Loading `"tokenizer.json"` locally at `"./tokenizer.json"`
2024-05-18T01:41:29.024145Z INFO mistralrs_core::pipeline::gguf: Loading `"config.json"` locally at `"./config.json"`
2024-05-18T01:41:29.024256Z INFO hf_hub: Token file not found "/root/.cache/huggingface/token"
2024-05-18T01:41:29.333151Z INFO mistralrs_core::pipeline: Loading `"model.gguf"` locally at `"./model.gguf"`
thread 'main' panicked at mistralrs-core/src/pipeline/gguf.rs:290:58:
RequestError(Status(401, Response[status: 401, status_text: Unauthorized, url: https://huggingface.co/api/models/revision/main]))
stack backtrace:
0: rust_begin_unwind
1: core::panicking::panic_fmt
2: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_hf
3: tokio::runtime::context::runtime::enter_runtime
4: tokio::runtime::runtime::Runtime::block_on
5: mistralrs_server::main
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
Note that --token-source none has no effect (it must be given prior to the gguf subcommand where it is not considered a valid option), the code path still goes through load_model_from_hf (which will then forward to load_model_from_path if it didn't panic):
https://github.com/EricLBuehler/mistral.rs/blob/ca9bf7d1a8a67bd69a3eed89841a106d2e518c45/mistralrs-core/src/pipeline/gguf.rs#L303
https://github.com/EricLBuehler/mistral.rs/blob/ca9bf7d1a8a67bd69a3eed89841a106d2e518c45/mistralrs-core/src/pipeline/mod.rs#L180-L187
https://github.com/EricLBuehler/mistral.rs/blob/ca9bf7d1a8a67bd69a3eed89841a106d2e518c45/mistralrs-core/src/pipeline/mod.rs#L210
--token-source does accept none and refuses None or not-a-valid-variant, so I'm not sure why the output suggests it's still trying to use the default TokenSource::CacheToken? But it should be TokenSource::None (EDIT: Confirmed, this is a check hf_hub crate does regardless)
https://github.com/EricLBuehler/mistral.rs/blob/ca9bf7d1a8a67bd69a3eed89841a106d2e518c45/mistralrs-server/src/main.rs#L94-L95
Initial attempt
Let's follow the problem from CLI to hugging face API call with the token:
EDIT: Collapsed for brevity (not relevant)
https://github.com/EricLBuehler/mistral.rs/blob/ca9bf7d1a8a67bd69a3eed89841a106d2e518c45/mistralrs-server/src/main.rs#L245
https://github.com/EricLBuehler/mistral.rs/blob/ca9bf7d1a8a67bd69a3eed89841a106d2e518c45/mistralrs-server/src/main.rs#L284-L296
https://github.com/EricLBuehler/mistral.rs/blob/ca9bf7d1a8a67bd69a3eed89841a106d2e518c45/mistralrs-core/src/pipeline/gguf.rs#L278-L303
https://github.com/EricLBuehler/mistral.rs/blob/ca9bf7d1a8a67bd69a3eed89841a106d2e518c45/mistralrs-core/src/pipeline/macros.rs#L133-L138
https://github.com/EricLBuehler/mistral.rs/blob/ca9bf7d1a8a67bd69a3eed89841a106d2e518c45/mistralrs-core/src/pipeline/gguf.rs#L31
https://github.com/EricLBuehler/mistral.rs/blob/ca9bf7d1a8a67bd69a3eed89841a106d2e518c45/mistralrs-core/src/utils/tokens.rs#L15-L18
https://github.com/EricLBuehler/mistral.rs/blob/ca9bf7d1a8a67bd69a3eed89841a106d2e518c45/mistralrs-core/src/utils/tokens.rs#L53
The macro calls the huggingface API and adds the token .with_token(Some(get_token($token_source)?)), which is an empty string for TokenSource::None. The upstream crate .with_token() expects an option, so it's being wrapped with Some() to always assume a valid token is provided to it?
https://docs.rs/hf-hub/latest/hf_hub/api/sync/struct.ApiBuilder.html#method.with_token
https://github.com/huggingface/hf-hub/blob/9d6502f5bc2e69061c132f523c76a76dad470477/src/api/sync.rs#L143-L157
/// Sets the token to be used in the API
pub fn with_token(mut self, token: Option<String>) -> Self {
self.token = token;
self
}
fn build_headers(&self) -> HeaderMap {
let mut headers = HeaderMap::new();
let user_agent = format!("unkown/None; {NAME}/{VERSION}; rust/unknown");
headers.insert(USER_AGENT, user_agent);
if let Some(token) = &self.token {
headers.insert(AUTHORIZATION, format!("Bearer {token}"));
}
headers
}
Because the empty string was passed in, it passes that conditional and we add the authorization HTTP header with Bearer " (empty value). If it was instead None value passed to the API here, it'd skip the header and that error would be avoided.
Next up, back in mistral.rs with that same macro, the expected tokenizer.json and config.json files are presumably being enforced by the logic here:
https://github.com/EricLBuehler/mistral.rs/blob/ca9bf7d1a8a67bd69a3eed89841a106d2e518c45/mistralrs-core/src/pipeline/macros.rs#L147-L154
Workaround
I'm terrible at debugging, so I sprinkled a bunch of info! lines to track where the logic in the macro was failing:
https://github.com/EricLBuehler/mistral.rs/blob/ca9bf7d1a8a67bd69a3eed89841a106d2e518c45/mistralrs-core/src/pipeline/macros.rs#L180-L191
The api_dir_list at this point fails with due to the 401 response:
https://github.com/EricLBuehler/mistral.rs/blob/ca9bf7d1a8a67bd69a3eed89841a106d2e518c45/mistralrs-core/src/pipeline/macros.rs#L2-L18
I'm not familiar with what this part of the code is trying to do, but for local/offline use the HF API shouldn't be queried at all... but it seems to be enforced?
Since 404 errors have an exception to panic, if you do the same for 401 then it's happy (tokenizer_config.json must also be provided though):
https://github.com/EricLBuehler/mistral.rs/blob/ca9bf7d1a8a67bd69a3eed89841a106d2e518c45/mistralrs-core/src/pipeline/macros.rs#L15
# Example:
if resp.into_response().is_some_and(|r| !(matches!(r.status(), 401 | 404))) {
Proper solution is probably to opt-out of the HF API entirely though?
Hi @MoonRide303!
Our close integration with the HF hub is intentional, as generally it is better to use the official tokenizer. However, I agree that it would be nice to enable loading from only a GGUF file. I'll begin work on this, and it shouldn't be too hard.
@polarathene:
Note that --token-source none has no effect (it must be given prior to the gguf subcommand where it is not considered a valid option), the code path still goes through load_model_from_hf (which will then forward to load_model_from_path if it didn't panic):
I think this behavior can be improved, I'll make a modification.
Agree with https://github.com/EricLBuehler/mistral.rs/issues/326#issuecomment-2119166130. The prior PR change was the minimal chnages needed to load a known HF model from local. It is an awkward UX to have for a local-only model.
I think there is a strong use case for loading from file without access to hugging face. HF is good! But, if you're trying to use an LLM in production, it's another failure point if your access to HF goes down. Also, there is always the risk that the creators of the LLM model might deny access to the repo at some point in the future.
Anyways, trying to get this to work locally now with the rust library. load_model_from_path requires the ModelPaths object, which doesn't seem to be importable from mistralrs/src/lib.rs.
I think there is a strong use case for loading from file without access to hugging face. HF is good! But, if you're trying to use an LLM in production, it's another failure point if your access to HF goes down. Also, there is always the risk that the creators of the LLM model might deny access to the repo at some point in the future.
Yes, especially when using a GGUF file as otherwise, there is always ISQ. I'm working on adding this in #345.
Anyways, trying to get this to work locally now with the rust library. load_model_from_path requires the ModelPaths object, which doesn't seem to be importable from mistralrs/src/lib.rs.
Ah, sorry, that was an oversight. I just merged #348, which both exposes those, and also exposes the Device, DType and a few other useful types so that you do not need to explicitly depend on our Candle branch.
@MoonRide303, @polarathene, @Jeadie, @joshpopelka20, @ShelbyJenkins
I just merged #345, which enables using the GGUF tokenizer. The implementation is tested against the HF tokenizer in CI, so you have a guarantee that it is correct. This is the applicable readme section.
Here is an example:
cargo run --release --features ... -- -i --chat-template <chat_template> gguf -m . -f Phi-3-mini-128k-instruct-q4_K_M.gguf
I would appreciate your thoughts on how this can be improved!
@EricLBuehler Not strictly related to this issue, but I updated to current (12.5) CUDA version few days ago, and mistral.rs (as of v0.1.11) no longer compiles. Not blocking compilation with a newer (and possibly backward-compatible) versions of CUDA would be definitely an improvement, allowing me to verify if / how the fix works ^^ (alternative: provide binary releases).
Compiling onig_sys v69.8.1
error: failed to run custom build command for `cudarc v0.11.1`
Caused by:
process didn't exit successfully: `D:\repos-git\mistral.rs\target\release\build\cudarc-2198e5ff31cf1aaa\build-script-build` (exit code: 101)
--- stdout
cargo:rerun-if-changed=build.rs
cargo:rerun-if-env-changed=CUDA_ROOT
cargo:rerun-if-env-changed=CUDA_PATH
cargo:rerun-if-env-changed=CUDA_TOOLKIT_ROOT_DIR
--- stderr
thread 'main' panicked at C:\Users\[REDACTED]\.cargo\registry\src\index.crates.io-6f17d22bba15001f\cudarc-0.11.1\build.rs:54:14:
Unsupported cuda toolkit version: `12050`. Please raise a github issue.
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
warning: build failed, waiting for other jobs to finish...
@MoonRide303, yes, but unfortunately that's a problem higher up in the dependency graph. There's a PR for that here: coreylowman/cudarc#238, and I'll let you know when it gets merged.
Alternatively, could you try out one of our docker containers: https://github.com/EricLBuehler/mistral.rs/pkgs/container/mistral.rs
I am currently working on a project where I need to use the gguf model locally. However, I am not very familiar with calling Rust libraries directly.
Could you please provide an example of how to invoke gguf locally in Rust? A simple example would be very helpful for my understanding.
Thank you for your assistance!
Could you please provide an example of how to invoke gguf locally in Rust? A simple example would be very helpful for my understanding.
Absolutely, here is a simple example of running a GGUF model purely locally:
https://github.com/EricLBuehler/mistral.rs/blob/9273f2a9157ae9f646d08a3f91e799548c700765/mistralrs/examples/gguf_locally/main.rs#L10-L64
Please feel free to let me know if you have any questions!
@EricLBuehler Thank you for providing the example!
I tried running it, but I encountered the following error:
RequestError(Status(401, Response[status: 401, status_text: Unauthorized, url: https://huggingface.co/api/models/revision/main]))
Since this is a local example, I assumed that the HuggingFace Token wouldn't be necessary. Is this not the case?
HI @solaoi, that should be fixed now, can you please try it again after a git pull?
@MoonRide303
Not strictly related to this issue, but I updated to current (12.5) CUDA version few days ago, and mistral.rs (as of v0.1.11) no longer compiles.
cudarc just merged 12.5 support, so this should compile now.
@EricLBuehler Pros: it compiles. Cons: doesn't work.
mistralrs-server.exe gguf -m . -f Mistral-7B-Instruct-v0.3-Q6_K.gguf
thread 'main' panicked at C:\Users\[REDACTED]\.cargo\registry\src\index.crates.io-6f17d22bba15001f\cudarc-0.11.2\src\driver\sys\mod.rs:43:71:
called `Result::unwrap()` on an `Err` value: LoadLibraryExW { source: Os { code: 126, kind: Uncategorized, message: "Nie można odnaleźć określonego modułu." } }
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Not sure what might be causing this - but llama.cpp compiles and works without issues, so I'd assume my env is fine.
@MoonRide303, it looks like you are using Windows. This issue has been reported here (coreylowman/cudarc#219) and here (huggingface/candle#2175). Can you add the path to your libcuda.so to LD_LIBRARY_PATH?
@EricLBuehler .so ELFs and LD_LIBRARY_PATH won't work on Windows. I am compiling and using dynamically linked CUDA-accelerated llama.cpp builds without issues, so CUDA .dlls should be in my path already.
$ which nvcuda.dll
/c/Windows/system32/nvcuda.dll
$ which nvrtc64_120_0.dll
/c/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v12.5/bin/nvrtc64_120_0.dll
$ which cudart64_12.dll
/c/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v12.5/bin/cudart64_12.dll
Right, sorry, my mistake. On Windows, do you know if you have multiple CUDA installations, can you run:
$ dir /c/Program Files/NVIDIA GPU Computing Toolkit/CUDA/
@EricLBuehler Got some, but those were just empty dirs from old versions:
$ ls "/c/Program Files/NVIDIA GPU Computing Toolkit/CUDA/"
v11.7/ v11.8/ v12.1/ v12.4/ v12.5/
I removed all except 12.5, and it didn't help in any way. But it shouldn't matter as long as necessary .dll are in the path (and they are).
that should be fixed now, can you please try it again after a
git pull?
I did a fresh git clone and build in a container, ~~so I'm not sure why I'm still encountering the 401.~~
@EricLBuehler I assume the reason you're not experiencing that is because the default for --token-source is cache which you've probably been utilizing? You need to add --token-source none and should also hit the same problem?
I pointed out the 401 issue earlier. It can be bypassed after a patch, but proper solution would be to skip calling out to HF in the first place?
The loaders HF method isn't doing much beyond getting the paths and then calling the local method directly after implicitly with that extra data?:
https://github.com/EricLBuehler/mistral.rs/blob/527e7f5282c991d399110e21ddbef6c51bba607c/mistralrs-core/src/pipeline/gguf.rs#L282-L291
What is the actual minimum amount of paths needed? Can that whole method be skipped if paths are provided locally? Is the chat template required or can it fallback to a default (perhaps with warning)? I'm not sure what llama-cpp does but their local GGUF loader support doesn't require much upfront to run.
Otherwise this macro is presumably juggling conditions of API call vs fallback/alternative? (also I'm not too fond of duplicating the macro to adjust for local GGUF):
https://github.com/EricLBuehler/mistral.rs/blob/527e7f5282c991d399110e21ddbef6c51bba607c/mistralrs-core/src/pipeline/macros.rs#L156-L244
Ah I see the paths struct here:
https://github.com/EricLBuehler/mistral.rs/blob/527e7f5282c991d399110e21ddbef6c51bba607c/mistralrs-core/src/pipeline/mod.rs#L98-L111
How about this?:
- Some condition to opt-out of HF API when providing local file paths?
- Based on that condition handle either:
- Assign any user supplied paths from CLI
- Get all paths info via HF API
- Apply fallback paths for any mandatory paths that are still
None, or fail. load_model_from_path()can be called now. Theload_model_from_hf()would be changed to only return the paths, instead of minor convenience of callingload_model_from_path()internally?
I am a bit more familiar with this area of the project now, I might be able to take a shot at it once my active PR is merged 😅
Original response
Perhaps I am not using the command correctly:
Attempts
Can ignore most of this, mistralrs-server gguf -m . -f model.gguf fails with 401 unauthorized.
From the mistral.rs git repo at /mist, absolute path to model location for -m (EDIT: this probably should have been part of -f):
$ RUST_BACKTRACE=1 target/release/mistralrs-server gguf -m /models/Hermes-2-Pro-Mistral-7B.Q4_K_M -f Hermes-2-Pro-Mistral-7B.Q4_K_M.gguf
2024-05-30T00:06:47.470010Z INFO mistralrs_core::pipeline::gguf: Loading model `/models/Hermes-2-Pro-Mistral-7B.Q4_K_M` on Cuda(CudaDevice(DeviceId(1)))...
2024-05-30T00:06:47.508433Z INFO mistralrs_core::pipeline::gguf: Model config:
general.architecture: llama
general.file_type: 15
general.name: jeffq
general.quantization_version: 2
llama.attention.head_count: 32
llama.attention.head_count_kv: 8
llama.attention.layer_norm_rms_epsilon: 0.00001
llama.block_count: 32
llama.context_length: 32768
llama.embedding_length: 4096
llama.feed_forward_length: 14336
llama.rope.dimension_count: 128
llama.rope.freq_base: 10000
thread 'main' panicked at mistralrs-core/src/pipeline/gguf_tokenizer.rs:65:31:
no entry found for key
stack backtrace:
0: rust_begin_unwind
1: core::panicking::panic_fmt
2: core::panicking::panic_display
3: core::option::expect_failed
4: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_path
5: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_hf
6: tokio::runtime::context::runtime::enter_runtime
7: mistralrs_server::main
Error "no entry found for key"
From the model directory, absolute path to mistralrs-server:
$ RUST_BACKTRACE=1 /mist/target/release/mistralrs-server gguf -m . -f Hermes-2-Pro-Mistral-7B.Q4_K_M.gguf
thread 'main' panicked at mistralrs-core/src/pipeline/gguf.rs:282:58:
RequestError(Status(401, Response[status: 401, status_text: Unauthorized, url: https://huggingface.co/api/models/revision/main]))
stack backtrace:
0: rust_begin_unwind
1: core::panicking::panic_fmt
2: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_hf
3: tokio::runtime::context::runtime::enter_runtime
4: mistralrs_server::main
401 unauthorized.
Just to double check I copied the mistralrs-server executable to the same folder which is how I previously tried to run it in past comments:
$ RUST_BACKTRACE=1 ./server gguf -m . -f Hermes-2-Pro-Mistral-7B.Q4_K_M.gguf
thread 'main' panicked at mistralrs-core/src/pipeline/gguf.rs:282:58:
RequestError(Status(401, Response[status: 401, status_text: Unauthorized, url: https://huggingface.co/api/models/revision/main]))
stack backtrace:
0: rust_begin_unwind
1: core::panicking::panic_fmt
2: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_hf
3: tokio::runtime::context::runtime::enter_runtime
4: mistralrs_server::main
401 unauthorized again.
- Adding
--token-source nonedoesn't help. Nor-t ., both were the two other args I had originally used in past comments but AFAIK aren't necessary anymore? Produces same error as above with 401. - If I don't use
-t ., but use-mwith an absolute path to the directory for the-fit'll give the first output I shared in this comment. So I figured maybe the same for the-twhich then results in:
$ RUST_BACKTRACE=1 ./server --token-source none gguf -m /models/Hermes-2-Pro-Mistral-7B.Q4_K_M -t /models/Hermes-2-Pro-Mistral-7B.Q4_K_M -f Hermes-2-Pro-Mistral-7B.Q4_K_M.gguf
thread 'main' panicked at mistralrs-core/src/pipeline/gguf.rs:282:58:
RequestError(Status(404, Response[status: 404, status_text: Not Found, url: https://huggingface.co//models/Hermes-2-Pro-Mistral-7B.Q4_K_M/resolve/main/tokenizer.json]))
stack backtrace:
0: rust_begin_unwind
1: core::panicking::panic_fmt
2: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_hf::{{closure}}
3: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_hf
4: tokio::runtime::context::runtime::enter_runtime
5: mistralrs_server::main
So it's still trying to connect to HF 🤷♂️ (because of the mandatory -m arg I guess when I don't use .)
This model was one that you mentioned had a duplicate field (that error isn't being encountered here, although previously I had to add a patch to bypass a 401 panic, which you can see above).
@MoonRide303, @polarathene, the following command works on my machine after I merged #362:
cargo run --release --features cuda -- -i --token-source none --chat-template chat_templates/mistral.json gguf -m . -f mistral-7b-instruct-v0.1.Q4_K_M.gguf
Note: as documented in the README here, you need to specify the model id, file, and chat template when loading a local GGUF model without using the HF tokenizer. If you are using the HF tokenizer, you may specify -t/--tok-model-id which is a HF/local model ID to the tokenizer.json and tokenizer_config.json.
The loaders HF method isn't doing much beyond getting the paths and then calling the local method directly after implicitly with that extra data?:
Yes, it just queries the HTTP side and if that failes treats them as local paths. My thinking was that we should always try HTTP first, but maybe you can flip that in a future PR?
Otherwise this macro is presumably juggling conditions of API call vs fallback/alternative?
Not really, the api_dir_list! and api_get_file! macros handle that. get_paths_gguf! just handles the tokenizer loading differences between GGUF and anything else. I don't love it though, maybe we can use akin or something like that to deduplicate it? I haven't looked into that area.
Some condition to opt-out of HF API when providing local file paths?
That seems like a great idea, perhaps --local in the CLI and a flag in the MistralRs builder so that the Python and Rust APIs can accept it? Happy to accept a PR for that too.
@polarathene, the following command works on my machine after I merged #362:
cargo run --release --features cuda -- -i --token-source none --chat-template chat_templates/mistral.json gguf -m . -f mistral-7b-instruct-v0.1.Q4_K_M.gguf
Just realized you were referencing a change in the past hour, built again and your example works properly now 🎉
Original response
$ RUST_BACKTRACE=1 ./server -i --token-source none --chat-template /mist/chat_templates/mistral.json gguf -m . -f /models/mistral-7b-instruct-v0.1.Q4_K_M.gguf
thread 'main' panicked at mistralrs-core/src/pipeline/gguf.rs:282:58:
RequestError(Status(401, Response[status: 401, status_text: Unauthorized, url: https://huggingface.co/api/models/revision/main]))
stack backtrace:
0: rust_begin_unwind
1: core::panicking::panic_fmt
2: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_hf
3: tokio::runtime::context::runtime::enter_runtime
4: mistralrs_server::main
That is the same model as you used? I saw you link to it the other day (might be worth having the link by the README example if troubleshooting with a common model is advised?).
$ ./server --version
mistralrs-server 0.1.11
$ git log
commit 527e7f5282c991d399110e21ddbef6c51bba607c (grafted, HEAD -> master, origin/master, origin/HEAD)
Author: Eric Buehler <[email protected]>
Date: Wed May 29 10:12:24 2024 -0400
Merge pull request #360 from EricLBuehler/fix_unauth
Fix no auth token for local loading
Oh... mistook the PR you referenced as an older one, I see that's new.
However same command but changing just -f to another mistral model had this "no entry found for key" failure which I mentioned in my previous message:
$ RUST_BACKTRACE=1 target/release/mistralrs-server -i --token-source none --chat-template /mist/chat_templates/mistral.json gguf -m . -f /models/Hermes-2-Pro-Mistral-7B.Q4_K_M.gguf
2024-05-30T11:06:44.051393Z INFO mistralrs_core::pipeline::gguf: Loading model `.` on Cuda(CudaDevice(DeviceId(1)))...
2024-05-30T11:06:44.099117Z INFO mistralrs_core::pipeline::gguf: Model config:
general.architecture: llama
general.file_type: 15
general.name: jeffq
general.quantization_version: 2
llama.attention.head_count: 32
llama.attention.head_count_kv: 8
llama.attention.layer_norm_rms_epsilon: 0.00001
llama.block_count: 32
llama.context_length: 32768
llama.embedding_length: 4096
llama.feed_forward_length: 14336
llama.rope.dimension_count: 128
llama.rope.freq_base: 10000
thread 'main' panicked at mistralrs-core/src/pipeline/gguf_tokenizer.rs:65:31:
no entry found for key
stack backtrace:
0: rust_begin_unwind
1: core::panicking::panic_fmt
2: core::panicking::panic_display
3: core::option::expect_failed
4: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_path
5: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_hf
6: tokio::runtime::context::runtime::enter_runtime
7: mistralrs_server::main
I tried another model and it was loading then panicked because I mistyped a different chat template filename, probably should have verified the file existed before it began to attempt loading the model.
Tried a few other GGUF models from HF and some also failed with no entry found for key, yet these seem to work with llama-cpp so probably not quite there yet? 🤔 (Here's the Hermes model I tried that gave the above failure)
@polarathene I think you should be able to run the Hermes model now. I just merged #363, which allows the default unigram UNK token (0) in case it is missing.
Tried a few other GGUF models from HF and some also failed with no entry found for key, yet these seem to work with llama-cpp so probably not quite there yet? 🤔 (Here's the Hermes model I tried that gave the above failure)
Yeah, we only support the llama/replit GGUF tokenizer model for now as they are both unigram. After I merge #356, I'll add support for chat template via the GGUF file (I don't want to cause any more rebases :)), but until then, you should provide the Hermes chat template in the chat template file. In this case, it would be:
{
"chat_template": "{{bos_token}}{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}"
}
@EricLBuehler Got some, but those were just empty dirs from old versions:
$ ls "/c/Program Files/NVIDIA GPU Computing Toolkit/CUDA/" v11.7/ v11.8/ v12.1/ v12.4/ v12.5/I removed all except 12.5, and it didn't help in any way. But it shouldn't matter as long as necessary .dll are in the path (and they are).
@MoonRide303, coreylowman/cudarc#240 should fix this.
@MoonRide303, I think it should be fixed now, https://github.com/coreylowman/cudarc/pull/240 was merged and there are reports that it works for others. Can you please try it again after a git pull and cargo update?
Yeah, we only support the
llama/replitGGUF tokenizer model for now as they are both unigram.
I don't know much about these, but after a git pull I can confirm Hermes is working now while the models that aren't now report an error about the actual tokenizer missing support which is good to see 👍
I don't know much about these, but after a git pull I can confirm Hermes is working now while the models that aren't now report an error about the actual tokenizer missing support which is good to see 👍
Great! For reference, see these docs: https://github.com/ggerganov/ggml/blob/master/docs/gguf.md#ggml
@EricLBuehler tested (full rebuild) on current master (1d21c5f2d8a75545135741d615fbd7c41106d5d7), result:
mistralrs-server.exe gguf -m . -f Mistral-7B-Instruct-v0.3-Q6_K.gguf
thread 'main' panicked at C:\Users\[REDACTED]\.cargo\registry\src\index.crates.io-6f17d22bba15001f\cudarc-0.11.3\src\curand\sys\mod.rs:51:9:
Unable to find curand lib under the names ["curand", "curand64", "curand64_12", "curand64_125", "curand64_125_0", "curand64_120_5"]. Please open GitHub issue.
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Proper name for this one (as of CUDA 12.5 on Windows) should be:
$ which curand64_10.dll
/c/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v12.5/bin/curand64_10.dll
@MoonRide303 I opened an issue: coreylowman/cudarc#242. I'll let you know when a fix gets merged.