unknown dtype for tensor (BF16?)
Describe the bug
My Q8_0 quant of Athene-70B loads fine. I have another quant that is identical except the output and embedding tensors are BF16:
$ RUST_BACKTRACE=full ./mistralrs_server --interactive-mode --num-device-layers 13 --pa-ctxt-len 8192 gguf -m PATH -f Athene-70B-Q8_0-BF16.gguf
2024-08-01T17:08:17.446889Z INFO mistralrs_server: avx: true, neon: false, simd128: false, f16c: true
2024-08-01T17:08:17.446907Z INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> minp -> multinomial
2024-08-01T17:08:17.446917Z INFO mistralrs_server: Model kind is: quantized from gguf (no adapters)
2024-08-01T17:08:17.446981Z INFO mistralrs_core::pipeline::paths: Loading `Athene-70B-Q8_0-BF16.gguf` locally at `/PATH/Athene-70B-Q8_0-BF16.gguf`
2024-08-01T17:08:17.447033Z WARN mistralrs_core::pipeline::gguf: Device mapping and PagedAttention are incompatible, disabling PagedAttention.
Error: path: "/PATH/Athene-70B-Q8_0-BF16.gguf" unknown dtype for tensor 30
0: candle_core::error::Error::bt
1: candle_core::quantized::GgmlDType::from_u32
2: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_path
3: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_hf
4: mistralrs_server::main::{{closure}}
5: mistralrs_server::main
6: std::sys_common::backtrace::__rust_begin_short_backtrace
7: std::rt::lang_start::{{closure}}
8: std::rt::lang_start_internal
9: main
10: __libc_start_call_main
at ./csu/../sysdeps/nptl/libc_start_call_main.h:58:16
11: __libc_start_main_impl
at ./csu/../csu/libc-start.c:392:3
12: _start
Stack backtrace:
0: anyhow::error::<impl core::convert::From<E> for anyhow::Error>::from
1: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_path
2: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_hf
3: mistralrs_server::main::{{closure}}
4: mistralrs_server::main
5: std::sys_common::backtrace::__rust_begin_short_backtrace
6: std::rt::lang_start::{{closure}}
7: std::rt::lang_start_internal
8: main
9: __libc_start_call_main
at ./csu/../sysdeps/nptl/libc_start_call_main.h:58:16
10: __libc_start_main_impl
at ./csu/../csu/libc-start.c:392:3
11: _start
Latest commit or version
0.2.4
@oldgithubman yes, this is the problem. Please see huggingface/candle#2387. This will enable support for BF16 and more descriptive errors!
@oldgithubman given that the Candle PR hasn't been merged, I have mirrored my changes onto our Candle fork so we can proceed. Please see #691, which should enable this to work.
To test:
git pull
git switch gguf_bf16
<test command here>
@oldgithubman given that the Candle PR hasn't been merged, I have mirrored my changes onto our Candle fork so we can proceed. Please see #691, which should enable this to work.
To test:
git pull git switch gguf_bf16 <test command here>
$ RUST_BACKTRACE=full ./mistralrs_server -i -n 13 --pa-ctxt-len 8192 gguf -m PATH -f Athene-70B-Q8_0-BF16.gguf
2024-08-20T04:09:57.420536Z INFO mistralrs_server: avx: true, neon: false, simd128: false, f16c: true
2024-08-20T04:09:57.420558Z INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> minp -> multinomial
2024-08-20T04:09:57.420561Z INFO mistralrs_server: Using flash attention.
2024-08-20T04:09:57.420569Z WARN mistralrs_server: Using flash attention with a quantized model has no effect!
2024-08-20T04:09:57.420572Z INFO mistralrs_server: Model kind is: quantized from gguf (no adapters)
2024-08-20T04:09:57.430607Z INFO mistralrs_core::pipeline::paths: Loading `Athene-70B-Q8_0-BF16.gguf` locally at `/media/j/72B264BFB2648A05/Athene-70B-Q8_0-BF16.gguf`
2024-08-20T04:09:57.430858Z WARN mistralrs_core::pipeline::gguf: Device mapping and PagedAttention are incompatible, disabling PagedAttention.
2024-08-20T04:09:57.655808Z INFO mistralrs_core::pipeline::gguf: Model config:
general.architecture: llama
general.basename: Athene
general.file_type: 7
general.languages: en
general.license: cc-by-nc-4.0
general.name: Athene 70B
general.organization: Nexusflow
general.quantization_version: 2
general.size_label: 70B
general.tags: RLHF, Nexusflow, Athene, Chat Model
general.type: model
llama.attention.head_count: 64
llama.attention.head_count_kv: 8
llama.attention.layer_norm_rms_epsilon: 0.00001
llama.block_count: 80
llama.context_length: 8192
llama.embedding_length: 8192
llama.feed_forward_length: 28672
llama.rope.dimension_count: 128
llama.rope.freq_base: 500000
llama.vocab_size: 128256
quantize.imatrix.entries_count: 560
quantize.imatrix.file: FILE
2024-08-20T04:09:57.883996Z INFO mistralrs_core::gguf::gguf_tokenizer: GGUF tokenizer model is `gpt2`, kind: `Bpe`, num tokens: 128256, num added tokens: 0, num merges: 280147, num scores: 0
2024-08-20T04:09:57.893896Z INFO mistralrs_core::gguf::chat_template: Discovered and using GGUF chat template: `{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}{% endif %}`
Error: quantized type BF16 is not supported yet
0: candle_core::error::Error::bt
1: candle_core::quantized::ggml_file::qtensor_from_ggml
2: candle_core::quantized::gguf_file::Content::tensor
3: <mistralrs_core::models::quantized_llama::ModelWeights as mistralrs_core::utils::model_config::FromGGUF>::from_gguf
4: mistralrs_core::utils::model_config::<impl core::convert::TryFrom<mistralrs_core::utils::model_config::ModelParams<mistralrs_core::utils::model_config::ParamsGGUF>> for mistralrs_core::models::quantized_llama::ModelWeights>::try_from
5: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_path
6: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_hf
7: mistralrs_server::main::{{closure}}
8: mistralrs_server::main
9: std::sys_common::backtrace::__rust_begin_short_backtrace
10: std::rt::lang_start::{{closure}}
11: std::rt::lang_start_internal
12: main
13: __libc_start_call_main
at ./csu/../sysdeps/nptl/libc_start_call_main.h:58:16
14: __libc_start_main_impl
at ./csu/../csu/libc-start.c:392:3
15: _start
Stack backtrace:
0: anyhow::error::<impl core::convert::From<E> for anyhow::Error>::from
1: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_path
2: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_hf
3: mistralrs_server::main::{{closure}}
4: mistralrs_server::main
5: std::sys_common::backtrace::__rust_begin_short_backtrace
6: std::rt::lang_start::{{closure}}
7: std::rt::lang_start_internal
8: main
9: __libc_start_call_main
at ./csu/../sysdeps/nptl/libc_start_call_main.h:58:16
10: __libc_start_main_impl
at ./csu/../csu/libc-start.c:392:3
11: _start
@oldgithubman thanks, that should be fixed now if you git pull again and retry!
@oldgithubman thanks, that should be fixed now if you
git pullagain and retry!
ERROR mistralrs_core::engine: prompt step - Model failed with error: WithBacktrace { inner: Msg("unsupported dtype for quantized matmul BF16"), backtrace: Backtrace [{ fn: "candle_core::error::Error::bt" }, { fn: "<candle_core::quantized::QMatMul as candle_core::Module>::forward" }, { fn: "<mistralrs_quant::gguf::GgufMatMul as mistralrs_quant::QuantMethod>::forward" }, { fn: "mistralrs_core::models::quantized_llama::ModelWeights::forward" }, { fn: "<mistralrs_core::pipeline::gguf::GGUFPipeline as mistralrs_core::pipeline::Pipeline>::forward_inputs" }, { fn: "mistralrs_core::pipeline::Pipeline::step::{{closure}}" }, { fn: "mistralrs_core::engine::Engine::run::{{closure}}" }, { fn: "std::sys_common::backtrace::__rust_begin_short_backtrace" }, { fn: "core::ops::function::FnOnce::call_once{{vtable.shim}}" }, { fn: "std::sys::pal::unix::thread::Thread::new::thread_start" }, { fn: "start_thread", file: "./nptl/pthread_create.c", line: 442 }, { fn: "__GI___clone3", file: "./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S", line: 81 }] }
2024-08-20T18:09:40.158461Z ERROR mistralrs_server::interactive_mode: Got a model error: "unsupported dtype for quantized matmul BF16\n 0: candle_core::error::Error::bt\n 1: <candle_core::quantized::QMatMul as candle_core::Module>::forward\n 2: <mistralrs_quant::gguf::GgufMatMul as mistralrs_quant::QuantMethod>::forward\n 3: mistralrs_core::models::quantized_llama::ModelWeights::forward\n 4: <mistralrs_core::pipeline::gguf::GGUFPipeline as mistralrs_core::pipeline::Pipeline>::forward_inputs\n 5: mistralrs_core::pipeline::Pipeline::step::{{closure}}\n 6: mistralrs_core::engine::Engine::run::{{closure}}\n 7: std::sys_common::backtrace::__rust_begin_short_backtrace\n 8: core::ops::function::FnOnce::call_once{{vtable.shim}}\n 9: std::sys::pal::unix::thread::Thread::new::thread_start\n 10: start_thread\n at ./nptl/pthread_create.c:442:8\n 11: __GI___clone3\n at ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81\n", response: ChatCompletionResponse { id: "0", choices: [Choice { finish_reason: "error", index: 0, message: ResponseMessage { content: Some(""), role: "assistant", tool_calls: [] }, logprobs: None }], created: 1724177337, model: "PATH", system_fingerprint: "local", object: "chat.completion", usage: Usage { completion_tokens: 0, prompt_tokens: 48, total_tokens: 48, avg_tok_per_sec: 1.1136891, avg_prompt_tok_per_sec: inf, avg_compl_tok_per_sec: NaN, total_time_sec: 43.1, total_prompt_time_sec: 0.0, total_completion_time_sec: 0.0 } }
@oldgithubman can you please run with RUST_BACKTRACE=1?
@oldgithubman can you please run with
RUST_BACKTRACE=1?
that was run with RUST_BACKTRACE=full. Do you still want me to do it with 1?
Ah ok thanks, I'll take a look.
@oldgithubman I just updated the branch to correctly setup the QMatMul (#691).
@oldgithubman I just updated the branch to correctly setup the QMatMul (#691).
Works!
@oldgithubman thanks for confirming! I just merged #691, so this feature is available on master and will be in 0.2.6 in a few days.
@oldgithubman closing this issue as it works, please feel free to reopen!