mistral.rs
mistral.rs copied to clipboard
bug: If device layers requested exceed model layers, host layers overflow
Describe the bug
If they number of device layers exceed the models, then the host layers to assign seems to wrap/overflow instead of the expected 0
.
NOTE: With llama-cpp
you can configure a larger number of layers and host layers will remain 0
while only the needed layers are used as device layers.
RUST_BACKTRACE=1 ./mistralrs-bench --num-device-layers 33 gguf -m . -t . -f Hermes-2-Pro-Mistral-7B.Q4_K_M.gguf
# ...
INFO mistralrs_core::device_map: Using 33 layers on device and 18446744073709551615 on host.
thread 'main' panicked at library/alloc/src/raw_vec.rs:25:5:
capacity overflow
stack backtrace:
0: rust_begin_unwind
1: core::panicking::panic_fmt
2: alloc::raw_vec::capacity_overflow
3: <T as alloc::vec::spec_from_elem::SpecFromElem>::from_elem
4: mistralrs_core::device_map::DeviceMapMetadata::into_mapper
5: mistralrs_core::models::quantized_llama::ModelWeights::from_gguf
6: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_path
7: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_hf
8: mistralrs_bench::main
Context:
-
Q4_KM
GGUF model used, but this applies to any model where the layers is exact. - Since
mistral.rs
seems to enforce loading first through HuggingFace API calls, I've worked around that by allowing 401 (Unauthorized) panic as described here, while unlikellama-cpp
additional config files is enforced... I sourced those from here, but it lacked atokenizer.json
file so I gave it one from another model (this has no relevance on the error encountered).
Additional feedback
mistral.rs
output doesn't information like layers as clear to me as llama-cpp
, and I don't know if there's some sort of inspect
command to output/query the metadata?
I had thought it was 33 layers, but looking over the llama-cpp
output again I see it's 32 with an extra layer appended afterwards:
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4060 Laptop GPU, compute capability 8.9, VMM: yes
llm_load_tensors: ggml ctx size = 0.30 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors: CPU buffer size = 70.38 MiB
llm_load_tensors: CUDA0 buffer size = 4095.16 MiB
I find this sort of information quite helpful, so if mistral.rs
could communicate that better too that'd be nice 👍
Better communicating the device/GPU like above would also be nice vs what it currently displays:
mistralrs_bench: Loading model `.` on Cuda(CudaDevice(DeviceId(1)))...
Latest commit
v0.1.8: https://github.com/EricLBuehler/mistral.rs/commit/ca9bf7d1a8a67bd69a3eed89841a106d2e518c45