mistral.rs cuda+cpu mode panic at unwrap

Describe the bug

Since I got out of memory all the time (GPU memory full after a few prompts, I guess), I tried '-n 16'. The result is not pretty, regardless on what number I choose (between 0-27, for this model)

...
2025-01-29T20:21:26.485294Z  INFO mistralrs_core::utils::log: Automatic loader type determined to be `qwen2`
2025-01-29T20:21:26.485324Z  INFO mistralrs_core::utils::log: Model has 28 repeating layers.
2025-01-29T20:21:26.485332Z  INFO mistralrs_core::utils::log: Loading model according to the following repeating layer mappings:
2025-01-29T20:21:26.485336Z  INFO mistralrs_core::utils::log: Layers 0-23: cuda[0]
2025-01-29T20:21:26.485342Z  INFO mistralrs_core::utils::log: Layers 24-27: cpu
2025-01-29T20:21:26.501757Z  INFO mistralrs_core::utils::normal: Detected minimum CUDA compute capability 8.9
2025-01-29T20:21:26.665318Z  INFO mistralrs_core::utils::normal: DType selected is F16.
2025-01-29T20:21:26.665338Z  WARN mistralrs_core::pipeline::normal: Device mapping contains a mix of GPU and CPU. There is no CPU support for PagedAttention, disabling PagedAttention.
2025-01-29T20:21:26.665374Z  INFO mistralrs_core::pipeline::normal: Model config: Config { vocab_size: 151936, hidden_size: 1536, intermediate_size: 8960, num_hidden_layers: 28, num_attention_heads: 12, num_key_value_heads: 2, max_position_embeddings: 131072, sliding_window: 4096, rope_theta: 10000.0, rms_norm_eps: 1e-6, hidden_act: Silu, use_flash_attn: false, quantization_config: None, tie_word_embeddings: false }
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 339/339 [00:03<00:00, 1125.32it/s]
2025-01-29T20:21:32.697416Z  INFO mistralrs_core::pipeline::chat_template: bos_toks = "<｜begin▁of▁sentence｜>", eos_toks = "<｜end▁of▁sentence｜>", unk_tok = `None`
2025-01-29T20:21:32.711477Z  INFO mistralrs_server: Model loaded.
2025-01-29T20:21:32.711772Z  INFO mistralrs_core: Enabling GEMM reduced precision in BF16.
2025-01-29T20:21:32.730393Z  INFO mistralrs_core: Enabling GEMM reduced precision in F16.
2025-01-29T20:21:32.733387Z  INFO mistralrs_core::cublaslt: Initialized cuBLASlt handle
2025-01-29T20:21:32.733461Z  INFO mistralrs_core: Beginning dummy run.
2025-01-29T20:21:32.869266Z  INFO mistralrs_core: Dummy run completed in 0.135790476s.
2025-01-29T20:21:32.869298Z  INFO mistralrs_server::interactive_mode: Starting interactive loop with sampling params: SamplingParams { temperature: Some(0.1), top_k: Some(32), top_p: Some(0.1), min_p: Some(0.05), top_n_logprobs: 0, frequency_penalty: Some(0.1), presence_penalty: Some(0.1), stop_toks: None, max_len: Some(4096), logits_bias: None, n_choices: 1, dry_params: Some(DrySamplingParams { sequence_breakers: ["\n", ":", "\"", "*"], multiplier: 0.0, base: 1.75, allowed_length: 2 }) }
====================
Welcome to interactive mode! Because this model is a text model, you can enter prompts and chat with the model.

Commands:
- `\help`: Display this message.
- `\exit`: Quit interactive mode.
- `\system <system message here>`:
    Add a system message to the chat without running the model.
    Ex: `\system Always respond as a pirate.`
====================
> thread '<unnamed>' panicked at mistralrs-core/src/pipeline/sampling.rs:275:26:
Expected receiver.: SendError { .. }
stack backtrace:
   0:     0x56420e0dad4a - <std::sys::backtrace::BacktraceLock::print::DisplayBacktrace as core::fmt::Display>::fmt::h5b6bd5631a6d1f6b
   1:     0x56420e108453 - core::fmt::write::h7550c97b06c86515
   2:     0x56420e0d5d93 - std::io::Write::write_fmt::h7b09c64fe0be9c84
   3:     0x56420e0dab92 - std::sys::backtrace::BacktraceLock::print::h2395ccd2c84ba3aa
   4:     0x56420e0dbc7c - std::panicking::default_hook::{{closure}}::he19d4c7230e07961
   5:     0x56420e0dbac2 - std::panicking::default_hook::hf614597d3c67bbdb
   6:     0x56420e0dc257 - std::panicking::rust_panic_with_hook::h8942133a8b252070
   7:     0x56420e0dc0ea - std::panicking::begin_panic_handler::{{closure}}::hb5f5963570096b29
   8:     0x56420e0db229 - std::sys::backtrace::__rust_end_short_backtrace::h6208cedc1922feda
   9:     0x56420e0dbd7c - rust_begin_unwind
  10:     0x56420c4104e0 - core::panicking::panic_fmt::h0c3082644d1bf418
  11:     0x56420c410926 - core::result::unwrap_failed::hd20b4aa073bda1e2
  12:     0x56420c797eb4 - mistralrs_core::pipeline::sampling::sample_and_add_toks::{{closure}}::h1c00dd9e42fb98b5
  13:     0x56420c7be02a - <mistralrs_core::pipeline::normal::NormalPipeline as mistralrs_core::pipeline::Pipeline>::sample_causal_gen::{{closure}}::h9b0659e9952fceb4
  14:     0x56420c7bec57 - mistralrs_core::pipeline::Pipeline::step::{{closure}}::h020bab495a2e59f9
  15:     0x56420c73d912 - mistralrs_core::engine::Engine::run::{{closure}}::hf4f7560d69ff8211
  16:     0x56420c7324bb - tokio::runtime::park::CachedParkThread::block_on::hbed330568fc0ecd9
  17:     0x56420c81451b - tokio::runtime::runtime::Runtime::block_on::hcf18a780627bb974
  18:     0x56420c71303d - std::sys::backtrace::__rust_begin_short_backtrace::hae6e88b3cce5c27a
  19:     0x56420cab1f5b - core::ops::function::FnOnce::call_once{{vtable.shim}}::h4415d8ec34ac95e2
  20:     0x56420e0e10fb - std::sys::pal::unix::thread::Thread::new::thread_start::hcc78f3943333fa94
  21:     0x7f6ce5bb6043 - start_thread
                               at ./nptl/pthread_create.c:447:8
  22:     0x7f6ce5c34778 - __GI___clone3
                               at ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
  23:                0x0 - <unknown>

Latest commit or version

$ ./mistralrs-server.gpu --version mistralrs-server 0.4.0

(pulled today morning from git)

Jan 29 '25 21:01 grinapo

meet problem too

Feb 07 '25 03:02 Sherlock-Holo

@grinapo @Sherlock-Holo I merged some fixes, can you please try this again after git pull and rebuild?

Feb 12 '25 21:02 EricLBuehler

I try commit c9ac3213264be0dbbe010ad0035715a563b64bb8 it seems doesn't panic, however when I running interactive mode ~/git/mistral.rs/target/release/mistralrs-server -i -n 999 gguf -f ../DeepSeek-R1-Distill-Qwen-7B-Q5_K_M.gguf -m .

when I ask

介绍一下你自己

it answer strange words

<think>1BD5*:38"'L/.$#-MF,60C)T=?2;V[J&`@EOZQ_\djU9KlH7(pfcSbGAgWsez�]w^trRhPau�4x!���iX~}vn��I��k���������|�{ĤN%q�+yͿ���i��inja٪o��¾��ӯ�ƺ���Y���������ݧ borders�����������������
                                                    �
                                                     � m���� ���퍂�Ձ���ֈ���
���������re���        󷛕�it

�le�stenar;
       ��erat� =    ion bingou tis the sed m�ic���entes in to dct ofinas hnd c a-- p th         //              ad n l fil** oro wurse)
etorimifiv {
ut,
el ( * {thodch Cex
ueem g                ay }ew v Sotationusame T ' Iul� reort� Aateheunig st());
pe               on " yamportistly----intassith on andturnoutubresersour ifageolaghttr== for;

ctionnt de D $pt this Blicir Rosurnkeandend asquect thatde r      anile // F pro;
 Lorelassain.

avter P("ri com return="idver->iz orataill>
 al E withind::ess trimportrom -artstr whment G                          you it jup H.. fromest Wublicconereityight }
 con new                   elformgetack [{
plant", at O:
 leextld seang classiveact notase****og.certodealue</ffineial propare();
}
ies chontize ex--------able intavecl U.s heost byype +iew shard(' :ib Nire.piceime_tac }

 Theure are(dy__astoid urou', res wasoc =>aceord).athokyst publicustidethisalltring usjectpp))name’ @ailult Jone iak pl********ullqueveptionansellignachSt}

pon we_p get & k_scc /ount */==== canInputystempromearamber's                    unER ==row �err_c);

 . enll##erv.get Vip ab['rrName #iteressdivear Thatedob                                appitionie contouldvent elildreturniaunctionRe </INlesud.
 has outple willStringuctoryON upformaryther"
per _ strso']
ongax godata(),

alse bo function.Codelakeivateft), */
 selferyliquest yourddadd constally allens void meselfcriatchichance’sec System In_mawph($ K do iminkge----------------teenceugork have)

ethaticIddefssoustonST                                 elsebo setset    ifepRE);
 but !contabeleldtp.t_fpro more");
ferbjectudeener",
ser////ents()
.Sreeointationskeyegown arieldneish Y####ef var',
old defThe_d |lect ne saioAT perval name socre dis end.m theyange true my nullings"> lo”ViewListExolorurre one_id.hound returnictessageCont":OReng >ickovclude hisark St ) private.A whoread">
.T_brrayveltemlogrypan theirpublicuser liwender man time data id****************rouperror any appled

.comarggh falsecomirst Chions_C]
.f pregrator overource";

I have to Ctrl+C to stop it

Feb 13 '25 03:02 Sherlock-Holo

@Sherlock-Holo @grinapo this error should be somewhat fixed in https://github.com/EricLBuehler/mistral.rs/issues/1137#issuecomment-2657877915. I would recommend specifying a chat template explicitly, as documented here.

Feb 13 '25 22:02 EricLBuehler

You probably meant rather this about templates.

Feb 21 '25 14:02 grinapo

commit 8d89c14b431a5c2d6115346a46acbda4840e7445 (HEAD -> master, origin/master, origin/HEAD) Date: Wed Feb 19 12:02:01 2025 -0500

$ RUST_BACKTRACE=1  ./mistralrs-server.gpu.pattn --chat-template ./chat_templates/default.json -i plain -m deepseek-ai/DeepSeek-R1-Distill-Qwen-7B 
2025-02-21T15:16:16.606335Z  INFO mistralrs_server: avx: true, neon: false, simd128: false, f16c: true
2025-02-21T15:16:16.606358Z  INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> minp -> multinomial
2025-02-21T15:16:16.606359Z  INFO mistralrs_server: Using flash attention.
2025-02-21T15:16:16.606373Z  INFO mistralrs_server: Model kind is: normal (no adapters)
2025-02-21T15:16:16.606442Z  INFO mistralrs_core::pipeline::normal: Loading `tokenizer.json` at `deepseek-ai/DeepSeek-R1-Distill-Qwen-7B`
2025-02-21T15:16:16.606470Z  INFO mistralrs_core::pipeline::normal: Loading `config.json` at `deepseek-ai/DeepSeek-R1-Distill-Qwen-7B`
2025-02-21T15:16:16.797919Z  INFO mistralrs_core::pipeline::paths: Found model weight filenames ["model-00001-of-000002.safetensors", "model-00002-of-000002.safetensors"]
2025-02-21T15:16:16.944763Z  INFO mistralrs_core::pipeline::normal: Loading `generation_config.json` at `deepseek-ai/DeepSeek-R1-Distill-Qwen-7B`
2025-02-21T15:16:17.235388Z  INFO mistralrs_core::pipeline::normal: Using chat template file at `./chat_templates/default.json`
2025-02-21T15:16:17.373092Z  INFO mistralrs_core::pipeline::normal: Prompt chunk size is 512.
2025-02-21T15:16:17.417386Z  INFO mistralrs_core::utils::normal: Detected minimum CUDA compute capability 8.9
2025-02-21T15:16:17.436278Z  INFO mistralrs_core::utils::normal: DType selected is BF16.
2025-02-21T15:16:17.436321Z  INFO mistralrs_core::utils::log: Automatic loader type determined to be `qwen2`
2025-02-21T15:16:17.662515Z  INFO mistralrs_core::pipeline::loaders: Using automatic device mapping parameters: text[max_seq_len: 4096, max_batch_size: 1].
2025-02-21T15:16:17.693737Z  INFO mistralrs_core::pipeline::loaders: Using automatic device mapping parameters: text[max_seq_len: 4096, max_batch_size: 1].
2025-02-21T15:16:17.693768Z  INFO mistralrs_core::utils::log: Model has 28 repeating layers.
2025-02-21T15:16:17.693772Z  INFO mistralrs_core::utils::log: Loading model according to the following repeating layer mappings:
2025-02-21T15:16:17.693779Z  INFO mistralrs_core::utils::log: Layers 0-10: cuda[0]
2025-02-21T15:16:17.693781Z  INFO mistralrs_core::utils::log: Layers 11-27: cpu
2025-02-21T15:16:17.701166Z  INFO mistralrs_core::utils::normal: Detected minimum CUDA compute capability 8.9
2025-02-21T15:16:17.740133Z  INFO mistralrs_core::utils::normal: DType selected is F16.
2025-02-21T15:16:17.740151Z  WARN mistralrs_core::pipeline::normal: Device mapping contains a mix of GPU and CPU. There is no CPU support for PagedAttention, disabling PagedAttention.
2025-02-21T15:16:17.740168Z  INFO mistralrs_core::pipeline::normal: Model config: Config { vocab_size: 152064, hidden_size: 3584, intermediate_size: 18944, num_hidden_layers: 28, num_attention_heads: 28, num_key_value_heads: 4, max_position_embeddings: 131072, sliding_window: 4096, rope_theta: 10000.0, rms_norm_eps: 1e-6, hidden_act: Silu, use_flash_attn: true, quantization_config: None, tie_word_embeddings: false }
2025-02-21T15:16:25.848278Z  INFO mistralrs_core::pipeline::chat_template: bos_toks = "<｜begin▁of▁sentence｜>", eos_toks = "<｜end▁of▁sentence｜>", unk_tok = `None`
2025-02-21T15:16:25.852517Z  INFO mistralrs_server: Model loaded.
2025-02-21T15:16:25.852790Z  INFO mistralrs_core: Enabling GEMM reduced precision in BF16.
2025-02-21T15:16:25.859010Z  INFO mistralrs_core: Enabling GEMM reduced precision in F16.
2025-02-21T15:16:25.859617Z  INFO mistralrs_core::cublaslt: Initialized cuBLASlt handle
2025-02-21T15:16:25.859672Z  INFO mistralrs_core: Beginning dummy run.
2025-02-21T15:16:26.113888Z  INFO mistralrs_core: Dummy run completed in 0.254206707s.
2025-02-21T15:16:26.113919Z  INFO mistralrs_server::interactive_mode: Starting interactive loop with sampling params: SamplingParams { temperature: Some(0.1), top_k: Some(32), top_p: Some(0.1), min_p: Some(0.05), top_n_logprobs: 0, frequency_penalty: Some(0.1), presence_penalty: Some(0.1), stop_toks: None, max_len: Some(4096), logits_bias: None, n_choices: 1, dry_params: Some(DrySamplingParams { sequence_breakers: ["\n", ":", "\"", "*"], multiplier: 0.0, base: 1.75, allowed_length: 2 }) }
====================
Welcome to interactive mode! Because this model is a text model, you can enter prompts and chat with the model.

Commands:
- `\help`: Display this message.
- `\exit`: Quit interactive mode.
- `\system <system message here>`:
    Add a system message to the chat without running the model.
    Ex: `\system Always respond as a pirate.`
====================
> thread '<unnamed>' panicked at mistralrs-core/src/pipeline/sampling.rs:289:26:
Expected receiver.: SendError { .. }
stack backtrace:
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::result::unwrap_failed
   3: mistralrs_core::pipeline::sampling::sample_and_add_toks::{{closure}}
   4: <mistralrs_core::pipeline::normal::NormalPipeline as mistralrs_core::pipeline::Pipeline>::sample_causal_gen::{{closure}}
   5: mistralrs_core::pipeline::Pipeline::step::{{closure}}
   6: mistralrs_core::engine::Engine::run::{{closure}}
   7: tokio::runtime::park::CachedParkThread::block_on
   8: tokio::runtime::runtime::Runtime::block_on
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
boo!
thread 'main' panicked at mistralrs-server/src/interactive_mode.rs:181:32:
called `Result::unwrap()` on an `Err` value: SendError { .. }
stack backtrace:
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::result::unwrap_failed
   3: mistralrs_server::interactive_mode::text_interactive_mode::{{closure}}
   4: mistralrs_server::main::{{closure}}
   5: tokio::runtime::park::CachedParkThread::block_on
   6: mistralrs_server::main
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
$

Feb 21 '25 15:02 grinapo

I do not say that I am familar with the system, so apologies if stating the obvious, but the exact same command line works with 1.5B (which fits into GPU) and fails with 7B (which doesn't, see above). Their chat templates (or actually whole tokenizer_config.json) are the same.

Feb 21 '25 15:02 grinapo