mistral.rs icon indicating copy to clipboard operation
mistral.rs copied to clipboard

bug: If device layers requested exceed model layers, host layers overflow

Open polarathene opened this issue 9 months ago • 5 comments

Describe the bug

If they number of device layers exceed the models, then the host layers to assign seems to wrap/overflow instead of the expected 0.

NOTE: With llama-cpp you can configure a larger number of layers and host layers will remain 0 while only the needed layers are used as device layers.

RUST_BACKTRACE=1 ./mistralrs-bench --num-device-layers 33 gguf -m . -t . -f Hermes-2-Pro-Mistral-7B.Q4_K_M.gguf
# ...
INFO mistralrs_core::device_map: Using 33 layers on device and 18446744073709551615 on host.

thread 'main' panicked at library/alloc/src/raw_vec.rs:25:5:
capacity overflow
stack backtrace:
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: alloc::raw_vec::capacity_overflow
   3: <T as alloc::vec::spec_from_elem::SpecFromElem>::from_elem
   4: mistralrs_core::device_map::DeviceMapMetadata::into_mapper
   5: mistralrs_core::models::quantized_llama::ModelWeights::from_gguf
   6: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_path
   7: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_hf
   8: mistralrs_bench::main

Context:

  • Q4_KM GGUF model used, but this applies to any model where the layers is exact.
  • Since mistral.rs seems to enforce loading first through HuggingFace API calls, I've worked around that by allowing 401 (Unauthorized) panic as described here, while unlike llama-cpp additional config files is enforced... I sourced those from here, but it lacked a tokenizer.json file so I gave it one from another model (this has no relevance on the error encountered).

Additional feedback

mistral.rs output doesn't information like layers as clear to me as llama-cpp, and I don't know if there's some sort of inspect command to output/query the metadata?

I had thought it was 33 layers, but looking over the llama-cpp output again I see it's 32 with an extra layer appended afterwards:

ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4060 Laptop GPU, compute capability 8.9, VMM: yes
llm_load_tensors: ggml ctx size =    0.30 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =    70.38 MiB
llm_load_tensors:      CUDA0 buffer size =  4095.16 MiB

I find this sort of information quite helpful, so if mistral.rs could communicate that better too that'd be nice 👍

Better communicating the device/GPU like above would also be nice vs what it currently displays:

mistralrs_bench: Loading model `.` on Cuda(CudaDevice(DeviceId(1)))...

Latest commit

v0.1.8: https://github.com/EricLBuehler/mistral.rs/commit/ca9bf7d1a8a67bd69a3eed89841a106d2e518c45

polarathene avatar May 19 '24 01:05 polarathene