mistral.rs bug: If device layers requested exceed model layers, host layers overflow

bug: If device layers requested exceed model layers, host layers overflow

Open polarathene opened this issue 9 months ago • 5 comments

Describe the bug

If they number of device layers exceed the models, then the host layers to assign seems to wrap/overflow instead of the expected 0.

NOTE: With llama-cpp you can configure a larger number of layers and host layers will remain 0 while only the needed layers are used as device layers.

RUST_BACKTRACE=1 ./mistralrs-bench --num-device-layers 33 gguf -m . -t . -f Hermes-2-Pro-Mistral-7B.Q4_K_M.gguf
# ...
INFO mistralrs_core::device_map: Using 33 layers on device and 18446744073709551615 on host.

thread 'main' panicked at library/alloc/src/raw_vec.rs:25:5:
capacity overflow
stack backtrace:
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: alloc::raw_vec::capacity_overflow
   3: <T as alloc::vec::spec_from_elem::SpecFromElem>::from_elem
   4: mistralrs_core::device_map::DeviceMapMetadata::into_mapper
   5: mistralrs_core::models::quantized_llama::ModelWeights::from_gguf
   6: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_path
   7: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_hf
   8: mistralrs_bench::main

Context:

Q4_KM GGUF model used, but this applies to any model where the layers is exact.
Since mistral.rs seems to enforce loading first through HuggingFace API calls, I've worked around that by allowing 401 (Unauthorized) panic as described here, while unlike llama-cpp additional config files is enforced... I sourced those from here, but it lacked a tokenizer.json file so I gave it one from another model (this has no relevance on the error encountered).

Additional feedback

mistral.rs output doesn't information like layers as clear to me as llama-cpp, and I don't know if there's some sort of inspect command to output/query the metadata?

I had thought it was 33 layers, but looking over the llama-cpp output again I see it's 32 with an extra layer appended afterwards:

ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4060 Laptop GPU, compute capability 8.9, VMM: yes
llm_load_tensors: ggml ctx size =    0.30 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =    70.38 MiB
llm_load_tensors:      CUDA0 buffer size =  4095.16 MiB

I find this sort of information quite helpful, so if mistral.rs could communicate that better too that'd be nice 👍

Better communicating the device/GPU like above would also be nice vs what it currently displays:

mistralrs_bench: Loading model `.` on Cuda(CudaDevice(DeviceId(1)))...

Latest commit

v0.1.8: https://github.com/EricLBuehler/mistral.rs/commit/ca9bf7d1a8a67bd69a3eed89841a106d2e518c45

May 19 '24 01:05 polarathene

mistral.rs mistral.rs copied to clipboard

bug: If device layers requested exceed model layers, host layers overflow

Describe the bug

Additional feedback

Latest commit

mistral.rs
mistral.rs copied to clipboard