cuda+cpu mode panic at unwrap
Describe the bug
Since I got out of memory all the time (GPU memory full after a few prompts, I guess), I tried '-n 16'. The result is not pretty, regardless on what number I choose (between 0-27, for this model)
...
2025-01-29T20:21:26.485294Z INFO mistralrs_core::utils::log: Automatic loader type determined to be `qwen2`
2025-01-29T20:21:26.485324Z INFO mistralrs_core::utils::log: Model has 28 repeating layers.
2025-01-29T20:21:26.485332Z INFO mistralrs_core::utils::log: Loading model according to the following repeating layer mappings:
2025-01-29T20:21:26.485336Z INFO mistralrs_core::utils::log: Layers 0-23: cuda[0]
2025-01-29T20:21:26.485342Z INFO mistralrs_core::utils::log: Layers 24-27: cpu
2025-01-29T20:21:26.501757Z INFO mistralrs_core::utils::normal: Detected minimum CUDA compute capability 8.9
2025-01-29T20:21:26.665318Z INFO mistralrs_core::utils::normal: DType selected is F16.
2025-01-29T20:21:26.665338Z WARN mistralrs_core::pipeline::normal: Device mapping contains a mix of GPU and CPU. There is no CPU support for PagedAttention, disabling PagedAttention.
2025-01-29T20:21:26.665374Z INFO mistralrs_core::pipeline::normal: Model config: Config { vocab_size: 151936, hidden_size: 1536, intermediate_size: 8960, num_hidden_layers: 28, num_attention_heads: 12, num_key_value_heads: 2, max_position_embeddings: 131072, sliding_window: 4096, rope_theta: 10000.0, rms_norm_eps: 1e-6, hidden_act: Silu, use_flash_attn: false, quantization_config: None, tie_word_embeddings: false }
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 339/339 [00:03<00:00, 1125.32it/s]
2025-01-29T20:21:32.697416Z INFO mistralrs_core::pipeline::chat_template: bos_toks = "<|begin▁of▁sentence|>", eos_toks = "<|end▁of▁sentence|>", unk_tok = `None`
2025-01-29T20:21:32.711477Z INFO mistralrs_server: Model loaded.
2025-01-29T20:21:32.711772Z INFO mistralrs_core: Enabling GEMM reduced precision in BF16.
2025-01-29T20:21:32.730393Z INFO mistralrs_core: Enabling GEMM reduced precision in F16.
2025-01-29T20:21:32.733387Z INFO mistralrs_core::cublaslt: Initialized cuBLASlt handle
2025-01-29T20:21:32.733461Z INFO mistralrs_core: Beginning dummy run.
2025-01-29T20:21:32.869266Z INFO mistralrs_core: Dummy run completed in 0.135790476s.
2025-01-29T20:21:32.869298Z INFO mistralrs_server::interactive_mode: Starting interactive loop with sampling params: SamplingParams { temperature: Some(0.1), top_k: Some(32), top_p: Some(0.1), min_p: Some(0.05), top_n_logprobs: 0, frequency_penalty: Some(0.1), presence_penalty: Some(0.1), stop_toks: None, max_len: Some(4096), logits_bias: None, n_choices: 1, dry_params: Some(DrySamplingParams { sequence_breakers: ["\n", ":", "\"", "*"], multiplier: 0.0, base: 1.75, allowed_length: 2 }) }
====================
Welcome to interactive mode! Because this model is a text model, you can enter prompts and chat with the model.
Commands:
- `\help`: Display this message.
- `\exit`: Quit interactive mode.
- `\system <system message here>`:
Add a system message to the chat without running the model.
Ex: `\system Always respond as a pirate.`
====================
> thread '<unnamed>' panicked at mistralrs-core/src/pipeline/sampling.rs:275:26:
Expected receiver.: SendError { .. }
stack backtrace:
0: 0x56420e0dad4a - <std::sys::backtrace::BacktraceLock::print::DisplayBacktrace as core::fmt::Display>::fmt::h5b6bd5631a6d1f6b
1: 0x56420e108453 - core::fmt::write::h7550c97b06c86515
2: 0x56420e0d5d93 - std::io::Write::write_fmt::h7b09c64fe0be9c84
3: 0x56420e0dab92 - std::sys::backtrace::BacktraceLock::print::h2395ccd2c84ba3aa
4: 0x56420e0dbc7c - std::panicking::default_hook::{{closure}}::he19d4c7230e07961
5: 0x56420e0dbac2 - std::panicking::default_hook::hf614597d3c67bbdb
6: 0x56420e0dc257 - std::panicking::rust_panic_with_hook::h8942133a8b252070
7: 0x56420e0dc0ea - std::panicking::begin_panic_handler::{{closure}}::hb5f5963570096b29
8: 0x56420e0db229 - std::sys::backtrace::__rust_end_short_backtrace::h6208cedc1922feda
9: 0x56420e0dbd7c - rust_begin_unwind
10: 0x56420c4104e0 - core::panicking::panic_fmt::h0c3082644d1bf418
11: 0x56420c410926 - core::result::unwrap_failed::hd20b4aa073bda1e2
12: 0x56420c797eb4 - mistralrs_core::pipeline::sampling::sample_and_add_toks::{{closure}}::h1c00dd9e42fb98b5
13: 0x56420c7be02a - <mistralrs_core::pipeline::normal::NormalPipeline as mistralrs_core::pipeline::Pipeline>::sample_causal_gen::{{closure}}::h9b0659e9952fceb4
14: 0x56420c7bec57 - mistralrs_core::pipeline::Pipeline::step::{{closure}}::h020bab495a2e59f9
15: 0x56420c73d912 - mistralrs_core::engine::Engine::run::{{closure}}::hf4f7560d69ff8211
16: 0x56420c7324bb - tokio::runtime::park::CachedParkThread::block_on::hbed330568fc0ecd9
17: 0x56420c81451b - tokio::runtime::runtime::Runtime::block_on::hcf18a780627bb974
18: 0x56420c71303d - std::sys::backtrace::__rust_begin_short_backtrace::hae6e88b3cce5c27a
19: 0x56420cab1f5b - core::ops::function::FnOnce::call_once{{vtable.shim}}::h4415d8ec34ac95e2
20: 0x56420e0e10fb - std::sys::pal::unix::thread::Thread::new::thread_start::hcc78f3943333fa94
21: 0x7f6ce5bb6043 - start_thread
at ./nptl/pthread_create.c:447:8
22: 0x7f6ce5c34778 - __GI___clone3
at ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
23: 0x0 - <unknown>
Latest commit or version
$ ./mistralrs-server.gpu --version mistralrs-server 0.4.0
(pulled today morning from git)
meet problem too
@grinapo @Sherlock-Holo I merged some fixes, can you please try this again after git pull and rebuild?
I try commit c9ac3213264be0dbbe010ad0035715a563b64bb8
it seems doesn't panic, however when I running interactive mode ~/git/mistral.rs/target/release/mistralrs-server -i -n 999 gguf -f ../DeepSeek-R1-Distill-Qwen-7B-Q5_K_M.gguf -m .
when I ask
介绍一下你自己
it answer strange words
<think>1BD5*:38"'L/.$#-MF,60C)T=?2;V[J&`@EOZQ_\djU9KlH7(pfcSbGAgWsez�]w^trRhPau�4x!���iX~}vn��I��k���������|�{ĤN%q�+yͿ���i��inja٪o��¾��ӯ�ƺ���Y���������ݧ borders�����������������
�
� m���� ���퍂�Ձ���ֈ���
���������re��� �it
�le�stenar;
��erat� = ion bingou tis the sed m�ic���entes in to dct ofinas hnd c a-- p th // ad n l fil** oro wurse)
etorimifiv {
ut,
el ( * {thodch Cex
ueem g ay }ew v Sotationusame T ' Iul� reort� Aateheunig st());
pe on " yamportistly----intassith on andturnoutubresersour ifageolaghttr== for;
ctionnt de D $pt this Blicir Rosurnkeandend asquect thatde r anile // F pro;
Lorelassain.
avter P("ri com return="idver->iz orataill>
al E withind::ess trimportrom -artstr whment G you it jup H.. fromest Wublicconereityight }
con new elformgetack [{
plant", at O:
leextld seang classiveact notase****og.certodealue</ffineial propare();
}
ies chontize ex--------able intavecl U.s heost byype +iew shard(' :ib Nire.piceime_tac }
Theure are(dy__astoid urou', res wasoc =>aceord).athokyst publicustidethisalltring usjectpp))name’ @ailult Jone iak pl********ullqueveptionansellignachSt}
pon we_p get & k_scc /ount */==== canInputystempromearamber's unER ==row �err_c);
. enll##erv.get Vip ab['rrName #iteressdivear Thatedob appitionie contouldvent elildreturniaunctionRe </INlesud.
has outple willStringuctoryON upformaryther"
per _ strso']
ongax godata(),
alse bo function.Codelakeivateft), */
selferyliquest yourddadd constally allens void meselfcriatchichance’sec System In_mawph($ K do iminkge----------------teenceugork have)
ethaticIddefssoustonST elsebo setset ifepRE);
but !contabeleldtp.t_fpro more");
ferbjectudeener",
ser////ents()
.Sreeointationskeyegown arieldneish Y####ef var',
old defThe_d |lect ne saioAT perval name socre dis end.m theyange true my nullings"> lo”ViewListExolorurre one_id.hound returnictessageCont":OReng >ickovclude hisark St ) private.A whoread">
.T_brrayveltemlogrypan theirpublicuser liwender man time data id****************rouperror any appled
.comarggh falsecomirst Chions_C]
.f pregrator overource";
I have to Ctrl+C to stop it
@Sherlock-Holo @grinapo this error should be somewhat fixed in https://github.com/EricLBuehler/mistral.rs/issues/1137#issuecomment-2657877915. I would recommend specifying a chat template explicitly, as documented here.
You probably meant rather this about templates.
commit 8d89c14b431a5c2d6115346a46acbda4840e7445 (HEAD -> master, origin/master, origin/HEAD) Date: Wed Feb 19 12:02:01 2025 -0500
$ RUST_BACKTRACE=1 ./mistralrs-server.gpu.pattn --chat-template ./chat_templates/default.json -i plain -m deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
2025-02-21T15:16:16.606335Z INFO mistralrs_server: avx: true, neon: false, simd128: false, f16c: true
2025-02-21T15:16:16.606358Z INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> minp -> multinomial
2025-02-21T15:16:16.606359Z INFO mistralrs_server: Using flash attention.
2025-02-21T15:16:16.606373Z INFO mistralrs_server: Model kind is: normal (no adapters)
2025-02-21T15:16:16.606442Z INFO mistralrs_core::pipeline::normal: Loading `tokenizer.json` at `deepseek-ai/DeepSeek-R1-Distill-Qwen-7B`
2025-02-21T15:16:16.606470Z INFO mistralrs_core::pipeline::normal: Loading `config.json` at `deepseek-ai/DeepSeek-R1-Distill-Qwen-7B`
2025-02-21T15:16:16.797919Z INFO mistralrs_core::pipeline::paths: Found model weight filenames ["model-00001-of-000002.safetensors", "model-00002-of-000002.safetensors"]
2025-02-21T15:16:16.944763Z INFO mistralrs_core::pipeline::normal: Loading `generation_config.json` at `deepseek-ai/DeepSeek-R1-Distill-Qwen-7B`
2025-02-21T15:16:17.235388Z INFO mistralrs_core::pipeline::normal: Using chat template file at `./chat_templates/default.json`
2025-02-21T15:16:17.373092Z INFO mistralrs_core::pipeline::normal: Prompt chunk size is 512.
2025-02-21T15:16:17.417386Z INFO mistralrs_core::utils::normal: Detected minimum CUDA compute capability 8.9
2025-02-21T15:16:17.436278Z INFO mistralrs_core::utils::normal: DType selected is BF16.
2025-02-21T15:16:17.436321Z INFO mistralrs_core::utils::log: Automatic loader type determined to be `qwen2`
2025-02-21T15:16:17.662515Z INFO mistralrs_core::pipeline::loaders: Using automatic device mapping parameters: text[max_seq_len: 4096, max_batch_size: 1].
2025-02-21T15:16:17.693737Z INFO mistralrs_core::pipeline::loaders: Using automatic device mapping parameters: text[max_seq_len: 4096, max_batch_size: 1].
2025-02-21T15:16:17.693768Z INFO mistralrs_core::utils::log: Model has 28 repeating layers.
2025-02-21T15:16:17.693772Z INFO mistralrs_core::utils::log: Loading model according to the following repeating layer mappings:
2025-02-21T15:16:17.693779Z INFO mistralrs_core::utils::log: Layers 0-10: cuda[0]
2025-02-21T15:16:17.693781Z INFO mistralrs_core::utils::log: Layers 11-27: cpu
2025-02-21T15:16:17.701166Z INFO mistralrs_core::utils::normal: Detected minimum CUDA compute capability 8.9
2025-02-21T15:16:17.740133Z INFO mistralrs_core::utils::normal: DType selected is F16.
2025-02-21T15:16:17.740151Z WARN mistralrs_core::pipeline::normal: Device mapping contains a mix of GPU and CPU. There is no CPU support for PagedAttention, disabling PagedAttention.
2025-02-21T15:16:17.740168Z INFO mistralrs_core::pipeline::normal: Model config: Config { vocab_size: 152064, hidden_size: 3584, intermediate_size: 18944, num_hidden_layers: 28, num_attention_heads: 28, num_key_value_heads: 4, max_position_embeddings: 131072, sliding_window: 4096, rope_theta: 10000.0, rms_norm_eps: 1e-6, hidden_act: Silu, use_flash_attn: true, quantization_config: None, tie_word_embeddings: false }
2025-02-21T15:16:25.848278Z INFO mistralrs_core::pipeline::chat_template: bos_toks = "<|begin▁of▁sentence|>", eos_toks = "<|end▁of▁sentence|>", unk_tok = `None`
2025-02-21T15:16:25.852517Z INFO mistralrs_server: Model loaded.
2025-02-21T15:16:25.852790Z INFO mistralrs_core: Enabling GEMM reduced precision in BF16.
2025-02-21T15:16:25.859010Z INFO mistralrs_core: Enabling GEMM reduced precision in F16.
2025-02-21T15:16:25.859617Z INFO mistralrs_core::cublaslt: Initialized cuBLASlt handle
2025-02-21T15:16:25.859672Z INFO mistralrs_core: Beginning dummy run.
2025-02-21T15:16:26.113888Z INFO mistralrs_core: Dummy run completed in 0.254206707s.
2025-02-21T15:16:26.113919Z INFO mistralrs_server::interactive_mode: Starting interactive loop with sampling params: SamplingParams { temperature: Some(0.1), top_k: Some(32), top_p: Some(0.1), min_p: Some(0.05), top_n_logprobs: 0, frequency_penalty: Some(0.1), presence_penalty: Some(0.1), stop_toks: None, max_len: Some(4096), logits_bias: None, n_choices: 1, dry_params: Some(DrySamplingParams { sequence_breakers: ["\n", ":", "\"", "*"], multiplier: 0.0, base: 1.75, allowed_length: 2 }) }
====================
Welcome to interactive mode! Because this model is a text model, you can enter prompts and chat with the model.
Commands:
- `\help`: Display this message.
- `\exit`: Quit interactive mode.
- `\system <system message here>`:
Add a system message to the chat without running the model.
Ex: `\system Always respond as a pirate.`
====================
> thread '<unnamed>' panicked at mistralrs-core/src/pipeline/sampling.rs:289:26:
Expected receiver.: SendError { .. }
stack backtrace:
0: rust_begin_unwind
1: core::panicking::panic_fmt
2: core::result::unwrap_failed
3: mistralrs_core::pipeline::sampling::sample_and_add_toks::{{closure}}
4: <mistralrs_core::pipeline::normal::NormalPipeline as mistralrs_core::pipeline::Pipeline>::sample_causal_gen::{{closure}}
5: mistralrs_core::pipeline::Pipeline::step::{{closure}}
6: mistralrs_core::engine::Engine::run::{{closure}}
7: tokio::runtime::park::CachedParkThread::block_on
8: tokio::runtime::runtime::Runtime::block_on
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
boo!
thread 'main' panicked at mistralrs-server/src/interactive_mode.rs:181:32:
called `Result::unwrap()` on an `Err` value: SendError { .. }
stack backtrace:
0: rust_begin_unwind
1: core::panicking::panic_fmt
2: core::result::unwrap_failed
3: mistralrs_server::interactive_mode::text_interactive_mode::{{closure}}
4: mistralrs_server::main::{{closure}}
5: tokio::runtime::park::CachedParkThread::block_on
6: mistralrs_server::main
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
$
I do not say that I am familar with the system, so apologies if stating the obvious, but the exact same command line works with 1.5B (which fits into GPU) and fails with 7B (which doesn't, see above). Their chat templates (or actually whole tokenizer_config.json) are the same.