mistral.rs unknown dtype for tensor (BF16?)

Describe the bug

My Q8_0 quant of Athene-70B loads fine. I have another quant that is identical except the output and embedding tensors are BF16:

$ RUST_BACKTRACE=full ./mistralrs_server --interactive-mode --num-device-layers 13 --pa-ctxt-len 8192 gguf -m PATH -f Athene-70B-Q8_0-BF16.gguf
2024-08-01T17:08:17.446889Z  INFO mistralrs_server: avx: true, neon: false, simd128: false, f16c: true
2024-08-01T17:08:17.446907Z  INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> minp -> multinomial
2024-08-01T17:08:17.446917Z  INFO mistralrs_server: Model kind is: quantized from gguf (no adapters)
2024-08-01T17:08:17.446981Z  INFO mistralrs_core::pipeline::paths: Loading `Athene-70B-Q8_0-BF16.gguf` locally at `/PATH/Athene-70B-Q8_0-BF16.gguf`
2024-08-01T17:08:17.447033Z  WARN mistralrs_core::pipeline::gguf: Device mapping and PagedAttention are incompatible, disabling PagedAttention.
Error: path: "/PATH/Athene-70B-Q8_0-BF16.gguf" unknown dtype for tensor 30
   0: candle_core::error::Error::bt
   1: candle_core::quantized::GgmlDType::from_u32
   2: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_path
   3: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_hf
   4: mistralrs_server::main::{{closure}}
   5: mistralrs_server::main
   6: std::sys_common::backtrace::__rust_begin_short_backtrace
   7: std::rt::lang_start::{{closure}}
   8: std::rt::lang_start_internal
   9: main
  10: __libc_start_call_main
             at ./csu/../sysdeps/nptl/libc_start_call_main.h:58:16
  11: __libc_start_main_impl
             at ./csu/../csu/libc-start.c:392:3
  12: _start


Stack backtrace:
   0: anyhow::error::<impl core::convert::From<E> for anyhow::Error>::from
   1: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_path
   2: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_hf
   3: mistralrs_server::main::{{closure}}
   4: mistralrs_server::main
   5: std::sys_common::backtrace::__rust_begin_short_backtrace
   6: std::rt::lang_start::{{closure}}
   7: std::rt::lang_start_internal
   8: main
   9: __libc_start_call_main
             at ./csu/../sysdeps/nptl/libc_start_call_main.h:58:16
  10: __libc_start_main_impl
             at ./csu/../csu/libc-start.c:392:3
  11: _start

Latest commit or version

0.2.4

Aug 01 '24 17:08 oldgithubman

@oldgithubman yes, this is the problem. Please see huggingface/candle#2387. This will enable support for BF16 and more descriptive errors!

Aug 02 '24 01:08 EricLBuehler

@oldgithubman given that the Candle PR hasn't been merged, I have mirrored my changes onto our Candle fork so we can proceed. Please see #691, which should enable this to work.

To test:

git pull
git switch gguf_bf16
<test command here>

Aug 17 '24 15:08 EricLBuehler

@oldgithubman given that the Candle PR hasn't been merged, I have mirrored my changes onto our Candle fork so we can proceed. Please see #691, which should enable this to work.

To test:
git pull
git switch gguf_bf16
<test command here>

$ RUST_BACKTRACE=full ./mistralrs_server -i -n 13 --pa-ctxt-len 8192 gguf -m PATH -f Athene-70B-Q8_0-BF16.gguf
2024-08-20T04:09:57.420536Z  INFO mistralrs_server: avx: true, neon: false, simd128: false, f16c: true
2024-08-20T04:09:57.420558Z  INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> minp -> multinomial
2024-08-20T04:09:57.420561Z  INFO mistralrs_server: Using flash attention.
2024-08-20T04:09:57.420569Z  WARN mistralrs_server: Using flash attention with a quantized model has no effect!
2024-08-20T04:09:57.420572Z  INFO mistralrs_server: Model kind is: quantized from gguf (no adapters)
2024-08-20T04:09:57.430607Z  INFO mistralrs_core::pipeline::paths: Loading `Athene-70B-Q8_0-BF16.gguf` locally at `/media/j/72B264BFB2648A05/Athene-70B-Q8_0-BF16.gguf`
2024-08-20T04:09:57.430858Z  WARN mistralrs_core::pipeline::gguf: Device mapping and PagedAttention are incompatible, disabling PagedAttention.
2024-08-20T04:09:57.655808Z  INFO mistralrs_core::pipeline::gguf: Model config:
general.architecture: llama
general.basename: Athene
general.file_type: 7
general.languages: en
general.license: cc-by-nc-4.0
general.name: Athene 70B
general.organization: Nexusflow
general.quantization_version: 2
general.size_label: 70B
general.tags: RLHF, Nexusflow, Athene, Chat Model
general.type: model
llama.attention.head_count: 64
llama.attention.head_count_kv: 8
llama.attention.layer_norm_rms_epsilon: 0.00001
llama.block_count: 80
llama.context_length: 8192
llama.embedding_length: 8192
llama.feed_forward_length: 28672
llama.rope.dimension_count: 128
llama.rope.freq_base: 500000
llama.vocab_size: 128256
quantize.imatrix.entries_count: 560
quantize.imatrix.file: FILE
2024-08-20T04:09:57.883996Z  INFO mistralrs_core::gguf::gguf_tokenizer: GGUF tokenizer model is `gpt2`, kind: `Bpe`, num tokens: 128256, num added tokens: 0, num merges: 280147, num scores: 0
2024-08-20T04:09:57.893896Z  INFO mistralrs_core::gguf::chat_template: Discovered and using GGUF chat template: `{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}{% endif %}`
Error: quantized type BF16 is not supported yet
   0: candle_core::error::Error::bt
   1: candle_core::quantized::ggml_file::qtensor_from_ggml
   2: candle_core::quantized::gguf_file::Content::tensor
   3: <mistralrs_core::models::quantized_llama::ModelWeights as mistralrs_core::utils::model_config::FromGGUF>::from_gguf
   4: mistralrs_core::utils::model_config::<impl core::convert::TryFrom<mistralrs_core::utils::model_config::ModelParams<mistralrs_core::utils::model_config::ParamsGGUF>> for mistralrs_core::models::quantized_llama::ModelWeights>::try_from
   5: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_path
   6: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_hf
   7: mistralrs_server::main::{{closure}}
   8: mistralrs_server::main
   9: std::sys_common::backtrace::__rust_begin_short_backtrace
  10: std::rt::lang_start::{{closure}}
  11: std::rt::lang_start_internal
  12: main
  13: __libc_start_call_main
             at ./csu/../sysdeps/nptl/libc_start_call_main.h:58:16
  14: __libc_start_main_impl
             at ./csu/../csu/libc-start.c:392:3
  15: _start


Stack backtrace:
   0: anyhow::error::<impl core::convert::From<E> for anyhow::Error>::from
   1: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_path
   2: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_hf
   3: mistralrs_server::main::{{closure}}
   4: mistralrs_server::main
   5: std::sys_common::backtrace::__rust_begin_short_backtrace
   6: std::rt::lang_start::{{closure}}
   7: std::rt::lang_start_internal
   8: main
   9: __libc_start_call_main
             at ./csu/../sysdeps/nptl/libc_start_call_main.h:58:16
  10: __libc_start_main_impl
             at ./csu/../csu/libc-start.c:392:3
  11: _start

Aug 20 '24 04:08 oldgithubman

@oldgithubman thanks, that should be fixed now if you git pull again and retry!

Aug 20 '24 10:08 EricLBuehler

@oldgithubman thanks, that should be fixed now if you git pull again and retry!

ERROR mistralrs_core::engine: prompt step - Model failed with error: WithBacktrace { inner: Msg("unsupported dtype for quantized matmul BF16"), backtrace: Backtrace [{ fn: "candle_core::error::Error::bt" }, { fn: "<candle_core::quantized::QMatMul as candle_core::Module>::forward" }, { fn: "<mistralrs_quant::gguf::GgufMatMul as mistralrs_quant::QuantMethod>::forward" }, { fn: "mistralrs_core::models::quantized_llama::ModelWeights::forward" }, { fn: "<mistralrs_core::pipeline::gguf::GGUFPipeline as mistralrs_core::pipeline::Pipeline>::forward_inputs" }, { fn: "mistralrs_core::pipeline::Pipeline::step::{{closure}}" }, { fn: "mistralrs_core::engine::Engine::run::{{closure}}" }, { fn: "std::sys_common::backtrace::__rust_begin_short_backtrace" }, { fn: "core::ops::function::FnOnce::call_once{{vtable.shim}}" }, { fn: "std::sys::pal::unix::thread::Thread::new::thread_start" }, { fn: "start_thread", file: "./nptl/pthread_create.c", line: 442 }, { fn: "__GI___clone3", file: "./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S", line: 81 }] }
2024-08-20T18:09:40.158461Z ERROR mistralrs_server::interactive_mode: Got a model error: "unsupported dtype for quantized matmul BF16\n   0: candle_core::error::Error::bt\n   1: <candle_core::quantized::QMatMul as candle_core::Module>::forward\n   2: <mistralrs_quant::gguf::GgufMatMul as mistralrs_quant::QuantMethod>::forward\n   3: mistralrs_core::models::quantized_llama::ModelWeights::forward\n   4: <mistralrs_core::pipeline::gguf::GGUFPipeline as mistralrs_core::pipeline::Pipeline>::forward_inputs\n   5: mistralrs_core::pipeline::Pipeline::step::{{closure}}\n   6: mistralrs_core::engine::Engine::run::{{closure}}\n   7: std::sys_common::backtrace::__rust_begin_short_backtrace\n   8: core::ops::function::FnOnce::call_once{{vtable.shim}}\n   9: std::sys::pal::unix::thread::Thread::new::thread_start\n  10: start_thread\n             at ./nptl/pthread_create.c:442:8\n  11: __GI___clone3\n             at ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81\n", response: ChatCompletionResponse { id: "0", choices: [Choice { finish_reason: "error", index: 0, message: ResponseMessage { content: Some(""), role: "assistant", tool_calls: [] }, logprobs: None }], created: 1724177337, model: "PATH", system_fingerprint: "local", object: "chat.completion", usage: Usage { completion_tokens: 0, prompt_tokens: 48, total_tokens: 48, avg_tok_per_sec: 1.1136891, avg_prompt_tok_per_sec: inf, avg_compl_tok_per_sec: NaN, total_time_sec: 43.1, total_prompt_time_sec: 0.0, total_completion_time_sec: 0.0 } }

Aug 20 '24 18:08 oldgithubman

@oldgithubman can you please run with RUST_BACKTRACE=1?

Aug 20 '24 18:08 EricLBuehler

@oldgithubman can you please run with RUST_BACKTRACE=1?

that was run with RUST_BACKTRACE=full. Do you still want me to do it with 1?

Aug 20 '24 23:08 oldgithubman

Ah ok thanks, I'll take a look.

Aug 20 '24 23:08 EricLBuehler

@oldgithubman I just updated the branch to correctly setup the QMatMul (#691).

Aug 20 '24 23:08 EricLBuehler

@oldgithubman I just updated the branch to correctly setup the QMatMul (#691).

Works!

Aug 21 '24 04:08 oldgithubman

@oldgithubman thanks for confirming! I just merged #691, so this feature is available on master and will be in 0.2.6 in a few days.

Aug 21 '24 13:08 EricLBuehler

@oldgithubman closing this issue as it works, please feel free to reopen!

Sep 01 '24 15:09 EricLBuehler