This adds support for Jamba (fixes #6372). (https://arxiv.org/abs/2403.19887)

To complement llama_kv_cache, I propose to add llama_rs_cache, as well as a top-level llama_past to more easily manage both at once.

The current implementation of recurrent states (initially written for Mamba) re-uses the tensors allocated for the KV cache to store its recurrent states. Obviously, when both Attention and recurrent states are used at the same time, this previous approach does not work.

Note that since this uses some of the same operators as Mamba, this is CPU-only for now. (see #6758)

API changes

Most of the changes are backward-compatible, but the llama_kv_cache_seq_rm and llama_kv_cache_seq_cp functions have been renamed and now return the token position of the next token after the end of the sequence(s) they affect.

This is necessary to properly handle recurrent state checkpoints with models that also use the KV cache (like Jamba (and eventually Griffin)), in case the last valid state doesn't line up with the requested removal when using, for example, llama_past_seq_rm.

Deprecate most llama_kv_cache_* functions to rename them to llama_past_*.
- Not strictly necessary, but since the return type and meaning changed, it might be easier to migrate existing code-bases with backward-compatible wrappers for any differing return behavior.
- It's no longer only a KV cache, so removing _kv_ from the names could make them less confusing when working with pure or mixed recurrent models
- I'm open to suggestions for a better name prefix! I think llama_past_* might be a bit too different from the previous name.
  - I could also re-use the old names, but this will be a breaking change because of the change in meaning of the return type of llama_kv_cache_seq_rm. It would also be confusing to figure out at a glance which functions are specific to the KV cache and which are specific the the recurrent state cache.
llama_past_seq_rm and llama_past_seq_cp now return n_past, which is the number of previous tokens in the affected sequence (or it can also be interpreted as the next token position at end of the sequence).
- It should be handled by processing the tokens again from this point until the desired end.
- Note that nothing needs to be handled when using these function on whole sequences (i.e. when -1 is passed to both p0 and p1)
llama_past_seq_max_pos returns -1 when there are no cells matching the specified seq_id, to allow calculating n_past by adding one to its result.
- llama_kv_cache_seq_max_pos previously returned 0 in this case, which makes it indistinguishable from a when there's a single cell with pos 0

New features

Jamba support
- The first hybrid Transformer+Mamba model in llama.cpp
State checkpoints for recurrent models
- Works best when n_parallel is at least 3 or 4 times the number of actual users
- Allows backtracking tokens from the end of the last generation without having to reprocess the whole context
  - Very useful with the server example when trimming the stop string
No longer unnecessarily allocate a KV cache for non-causal models (like BERT)
Variable GQA
- GGUF metadata {model}.attention.head_count_kv can now also be an array of int32_t, one value per layer
- Layers with 0 kv heads are considered recurrent layers (Mamba, in the case of Jamba).
- This will make proper support of DeciLM possible

Internal changes

new struct llama_rs_cache, a ring-buffered tree of recurrent states
- might be possible to simplify, but the data structure for recurrent states needs quick access to at least
  - the last state of a sequence (the tail cell)
  - the number of sequences for which a particular cell is the last state
  - how many sequences a cell is part of
  - the number of cells used by a sequence
  - the number of "active" sequences (which use cells they don't share with other sequences)
  - the number of cells used by "shared" sequences (e.g. the system prompt)
new struct llama_cache which contains both llama_kv_cache and llama_rs_cache
simpler Mamba state processing
- RS cells can be the tail of multiple sequences which allow
  - one-to-one instead of one-to-many state processing (for the ggml_ssm_* operators)
  - llama_past_seq_cp doesn't use more RS cells the more sequences there are
- RS slots are always contiguous, and are transparently defragmented if necessary when chosen.
new struct llama_ubatch for more metadata about sequences
batches are split with equal-length sequences for recurrent models
- This allows to simplify the SSM operations
- But the logits of the split batches have to be re-ordered when directly using llama_get_logits to match the old expected output. This is not a problem with llama_get_logits_ith, because there was already an indirection with lctx.output_ids which is reused.

TODO

[ ] Find a better prefix than llama_past_* Anybody has better name suggestions?
- llama_cache_*
- llama_past_*
- llama_kv_cache_*
  - Will be confusing with recurrent models, and doesn't offer the possibility of backward-compatible wrappers if the same name is used
- llama_ctx_cache_*
- llama_llm_cache_*
- llama_seq_cache_*
  - Could work, but would not help with discerning sequence-wise functions from cache-wise functions
- llama_tok_cache_*
- llama_comp_cache_*
- llama_past_cache_*
- llama_work_cache_*
- llama_kvrs_cache_*
- llama_causal_cache_*
- llama_context_cache_*
[x] session file save and restore
[ ] handle the returned n_past from the llama_past_* functions used in the various examples
- [x] server, main
- [ ] speculative, lookup, lookahead
[ ] add consistency tests (perhaps in tests/test-llama-past.cpp)
[ ] Make the recurrent state checkpoint interval configurable
[ ] Make the minimum number of recurrent states per client configurable to more than one

Future ideas

Fairly split the available KV cells among active sequences, similarly to RS cells.
- This could allow setting --parallel to a big value while not unnecessarily limiting the context size of the clients of the server if there aren't many. (related to https://github.com/ggerganov/llama.cpp/discussions/4130#discussioncomment-8594987)
Handle token shift (and Self-Extend?) when finding a slot.
- This could help with the fair split of KV cells by freeing cells of sequences which use more than their fair share of cells.

Testing

[x] Jamba
- @compilade I only tried https://huggingface.co/pszemraj/jamba-900M-v0.13-KIx2 for now, but it gives promising results!
  - [x] conversion (with convert-hf-to-gguf.py)
  - [x] inference (with main)
  - [ ] session save and restore
  - [ ] server with backtracking
  - [ ] quantization
- [ ] Official Jamba-v0.1 model https://huggingface.co/ai21labs/Jamba-v0.1
  - @compilade I don't have enough RAM to test this.
[x] Mamba with parallel
- @compilade Can confirm this continues to work as before.
[x] Embeddings with BERT
- @compilade With bge-small, gives exactly the same embeddings as on master, and now it doesn't unnecessarily allocate the KV cache!

Example output of jamba-900M-v0.13-KIx2 (click to expand)

$  ./bin/main -m /srv/LLMstash/tmp/jamba-900M.bf16.gguf --temp 0 -e -p "I believe the meaning of life is" --repeat-penalty 1.2 --repeat-last-n 256 -c 16384 -n 256
Log start
main: build = 3003 (0fd13e94)
main: built with gcc (GCC) 13.2.0 for x86_64-unknown-linux-gnu
main: seed  = 1716594011
llama_model_loader: loaded meta data with 26 key-value pairs and 189 tensors from /srv/LLMstash/tmp/jamba-900M.bf16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = jamba
llama_model_loader: - kv   1:                               general.name str              = jamba-900M-v0.13-KIx2
llama_model_loader: - kv   2:                          jamba.block_count u32              = 12
llama_model_loader: - kv   3:                       jamba.context_length u32              = 16384
llama_model_loader: - kv   4:                     jamba.embedding_length u32              = 1024
llama_model_loader: - kv   5:                  jamba.feed_forward_length u32              = 4096
llama_model_loader: - kv   6:                 jamba.attention.head_count u32              = 32
llama_model_loader: - kv   7:              jamba.attention.head_count_kv arr[i32,12]      = [0, 0, 8, 0, 0, 0, 8, 0, 0, 0, 8, 0]
llama_model_loader: - kv   8:                      jamba.ssm.conv_kernel u32              = 4
llama_model_loader: - kv   9:                       jamba.ssm.inner_size u32              = 2048
llama_model_loader: - kv  10:                       jamba.ssm.state_size u32              = 16
llama_model_loader: - kv  11:                   jamba.ssm.time_step_rank u32              = 256
llama_model_loader: - kv  12:     jamba.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  13:                         jamba.expert_count u32              = 8
llama_model_loader: - kv  14:                    jamba.expert_used_count u32              = 2
llama_model_loader: - kv  15:                          general.file_type u32              = 32
llama_model_loader: - kv  16:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  17:                         tokenizer.ggml.pre str              = gpt-2
llama_model_loader: - kv  18:                      tokenizer.ggml.tokens arr[str,65024]   = ["<EOT>", "<META>", "<META_START>", "...
llama_model_loader: - kv  19:                  tokenizer.ggml.token_type arr[i32,65024]   = [3, 3, 3, 3, 3, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  20:                      tokenizer.ggml.merges arr[str,64739]   = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "ĠĠ �...
llama_model_loader: - kv  21:                tokenizer.ggml.bos_token_id u32              = 0
llama_model_loader: - kv  22:                tokenizer.ggml.eos_token_id u32              = 0
llama_model_loader: - kv  23:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  24:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  25:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  121 tensors
llama_model_loader: - type bf16:   68 tensors
llm_load_vocab: special tokens definition check successful ( 29/65024 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = jamba
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 65024
llm_load_print_meta: n_merges         = 64739
llm_load_print_meta: n_ctx_train      = 16384
llm_load_print_meta: n_embd           = 1024
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 12
llm_load_print_meta: n_rot            = 32
llm_load_print_meta: n_embd_head_k    = 32
llm_load_print_meta: n_embd_head_v    = 32
llm_load_print_meta: n_gqa            = 0
llm_load_print_meta: n_embd_k_gqa     = 0
llm_load_print_meta: n_embd_v_gqa     = 0
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 4096
llm_load_print_meta: n_expert         = 8
llm_load_print_meta: n_expert_used    = 2
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = -1
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 16384
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 4
llm_load_print_meta: ssm_d_inner      = 2048
llm_load_print_meta: ssm_d_state      = 16
llm_load_print_meta: ssm_dt_rank      = 256
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = BF16
llm_load_print_meta: model params     = 887.66 M
llm_load_print_meta: model size       = 1.67 GiB (16.19 BPW) 
llm_load_print_meta: general.name     = jamba-900M-v0.13-KIx2
llm_load_print_meta: BOS token        = 0 '<EOT>'
llm_load_print_meta: EOS token        = 0 '<EOT>'
llm_load_print_meta: UNK token        = 0 '<EOT>'
llm_load_print_meta: PAD token        = 0 '<EOT>'
llm_load_print_meta: LF token         = 133 'Ä'
llm_load_tensors: ggml ctx size =    0.09 MiB
llm_load_tensors:        CPU buffer size =  1713.16 MiB
......................................
llama_new_context_with_model: n_ctx      = 16384
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_cache_init:        CPU cache buf size =    49.34 MiB
llama_new_context_with_model: SSM state size =     1.34 MiB, R (f32):    0.21 MiB, S (f32):    1.12 MiB
llama_new_context_with_model: KV cache size  =    48.00 MiB, K (f16):   24.00 MiB, V (f16):   24.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.25 MiB
llama_new_context_with_model:        CPU compute buffer size =  1062.03 MiB
llama_new_context_with_model: graph nodes  = 621
llama_new_context_with_model: graph splits = 1

system_info: n_threads = 2 / 4 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
sampling: 
        repeat_last_n = 256, repeat_penalty = 1.200, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.000
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature 
generate: n_ctx = 16384, n_batch = 2048, n_predict = 256, n_keep = 0


<EOT>I believe the meaning of life is not to be found in a single word, but rather as an expression of one's own feelings and thoughts.

The idea that we are all born with our bodies, whether they are human or animal, has been around for centuries. It was believed by some that it was something like a body made up of bones, which were attached to each other at birth. The most common form of this type of bone is called a "bone." This is what makes it so hard to tell if you're alive or dead. In fact, there are many different types of bones, including those that have been used for various purposes such as healing wounds, wounding wounds, etc.

In ancient times, people had a lot of teeth, and these were often very small. They could also be placed on top of their heads, where they would sit down and look at them. These were usually large, round stones, which were sometimes covered with hair. When the skin was removed from the head, the bones became more prominent, and the muscles began to grow larger.

This kind of bone was known as a "bone" because it was made out of two parts: the outermost part (the innermost portion) and the innermost part (the outermost
llama_print_timings:        load time =     252.28 ms
llama_print_timings:      sample time =     303.07 ms /   256 runs   (    1.18 ms per token,   844.68 tokens per second)
llama_print_timings: prompt eval time =     200.72 ms /     8 tokens (   25.09 ms per token,    39.86 tokens per second)
llama_print_timings:        eval time =   12516.79 ms /   255 runs   (   49.09 ms per token,    20.37 tokens per second)
llama_print_timings:       total time =   13213.95 ms /   263 tokens
Log end

May 25 '24 03:05 compilade

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 557 iterations 🚀

Expand details for performance related PR only

Concurrent users: 8, duration: 10m
HTTP request : avg=8384.34ms p(95)=20451.68ms fails=, finish reason: stop=510 truncated=47
Prompt processing (pp): avg=102.96tk/s p(95)=478.95tk/s
Token generation (tg): avg=36.48tk/s p(95)=48.13tk/s
ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=compilade/refactor-kv-cache commit=fee3c1d740c0e027c81e2f2f3fb48d619857175f

prompt_tokens_seconds

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 557 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1717475210 --> 1717475834
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 306.61, 306.61, 306.61, 306.61, 306.61, 572.5, 572.5, 572.5, 572.5, 572.5, 579.51, 579.51, 579.51, 579.51, 579.51, 601.73, 601.73, 601.73, 601.73, 601.73, 638.34, 638.34, 638.34, 638.34, 638.34, 702.62, 702.62, 702.62, 702.62, 702.62, 704.56, 704.56, 704.56, 704.56, 704.56, 718.91, 718.91, 718.91, 718.91, 718.91, 723.54, 723.54, 723.54, 723.54, 723.54, 739.59, 739.59, 739.59, 739.59, 739.59, 771.46, 771.46, 771.46, 771.46, 771.46, 802.48, 802.48, 802.48, 802.48, 802.48, 815.12, 815.12, 815.12, 815.12, 815.12, 804.65, 804.65, 804.65, 804.65, 804.65, 797.38, 797.38, 797.38, 797.38, 797.38, 800.86, 800.86, 800.86, 800.86, 800.86, 805.61, 805.61, 805.61, 805.61, 805.61, 803.64, 803.64, 803.64, 803.64, 803.64, 824.04, 824.04, 824.04, 824.04, 824.04, 823.3, 823.3, 823.3, 823.3, 823.3, 830.32, 830.32, 830.32, 830.32, 830.32, 832.47, 832.47, 832.47, 832.47, 832.47, 846.38, 846.38, 846.38, 846.38, 846.38, 842.07, 842.07, 842.07, 842.07, 842.07, 844.76, 844.76, 844.76, 844.76, 844.76, 861.96, 861.96, 861.96, 861.96, 861.96, 855.54, 855.54, 855.54, 855.54, 855.54, 854.58, 854.58, 854.58, 854.58, 854.58, 856.84, 856.84, 856.84, 856.84, 856.84, 860.17, 860.17, 860.17, 860.17, 860.17, 858.21, 858.21, 858.21, 858.21, 858.21, 861.33, 861.33, 861.33, 861.33, 861.33, 871.29, 871.29, 871.29, 871.29, 871.29, 847.29, 847.29, 847.29, 847.29, 847.29, 832.73, 832.73, 832.73, 832.73, 832.73, 831.59, 831.59, 831.59, 831.59, 831.59, 831.76, 831.76, 831.76, 831.76, 831.76, 835.52, 835.52, 835.52, 835.52, 835.52, 836.15, 836.15, 836.15, 836.15, 836.15, 836.37, 836.37, 836.37, 836.37, 836.37, 817.57, 817.57, 817.57, 817.57, 817.57, 820.16, 820.16, 820.16, 820.16, 820.16, 820.49, 820.49, 820.49, 820.49, 820.49, 820.0, 820.0, 820.0, 820.0, 820.0, 817.08, 817.08, 817.08, 817.08, 817.08, 820.83, 820.83, 820.83, 820.83, 820.83, 823.82, 823.82, 823.82, 823.82, 823.82, 823.03, 823.03, 823.03, 823.03, 823.03, 827.7, 827.7, 827.7, 827.7, 827.7, 826.96, 826.96, 826.96, 826.96, 826.96, 833.12, 833.12, 833.12, 833.12, 833.12, 832.75, 832.75, 832.75, 832.75, 832.75, 832.65, 832.65, 832.65, 832.65, 832.65, 826.23, 826.23, 826.23, 826.23, 826.23, 827.38, 827.38, 827.38, 827.38, 827.38, 827.43, 827.43, 827.43, 827.43, 827.43, 827.46, 827.46, 827.46, 827.46, 827.46, 825.87, 825.87, 825.87, 825.87, 825.87, 828.84, 828.84, 828.84, 828.84, 828.84, 829.05, 829.05, 829.05, 829.05, 829.05, 829.15, 829.15, 829.15]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 557 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1717475210 --> 1717475834
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 42.1, 42.1, 42.1, 42.1, 42.1, 30.42, 30.42, 30.42, 30.42, 30.42, 28.2, 28.2, 28.2, 28.2, 28.2, 28.69, 28.69, 28.69, 28.69, 28.69, 29.63, 29.63, 29.63, 29.63, 29.63, 30.55, 30.55, 30.55, 30.55, 30.55, 32.02, 32.02, 32.02, 32.02, 32.02, 32.76, 32.76, 32.76, 32.76, 32.76, 33.41, 33.41, 33.41, 33.41, 33.41, 33.56, 33.56, 33.56, 33.56, 33.56, 34.05, 34.05, 34.05, 34.05, 34.05, 33.99, 33.99, 33.99, 33.99, 33.99, 33.35, 33.35, 33.35, 33.35, 33.35, 33.38, 33.38, 33.38, 33.38, 33.38, 32.25, 32.25, 32.25, 32.25, 32.25, 31.71, 31.71, 31.71, 31.71, 31.71, 30.36, 30.36, 30.36, 30.36, 30.36, 30.81, 30.81, 30.81, 30.81, 30.81, 30.82, 30.82, 30.82, 30.82, 30.82, 30.39, 30.39, 30.39, 30.39, 30.39, 30.41, 30.41, 30.41, 30.41, 30.41, 30.5, 30.5, 30.5, 30.5, 30.5, 30.85, 30.85, 30.85, 30.85, 30.85, 30.97, 30.97, 30.97, 30.97, 30.97, 31.24, 31.24, 31.24, 31.24, 31.24, 31.45, 31.45, 31.45, 31.45, 31.45, 31.23, 31.23, 31.23, 31.23, 31.23, 31.18, 31.18, 31.18, 31.18, 31.18, 31.36, 31.36, 31.36, 31.36, 31.36, 31.43, 31.43, 31.43, 31.43, 31.43, 31.63, 31.63, 31.63, 31.63, 31.63, 31.71, 31.71, 31.71, 31.71, 31.71, 31.78, 31.78, 31.78, 31.78, 31.78, 31.61, 31.61, 31.61, 31.61, 31.61, 31.48, 31.48, 31.48, 31.48, 31.48, 31.35, 31.35, 31.35, 31.35, 31.35, 31.43, 31.43, 31.43, 31.43, 31.43, 31.54, 31.54, 31.54, 31.54, 31.54, 31.71, 31.71, 31.71, 31.71, 31.71, 31.79, 31.79, 31.79, 31.79, 31.79, 31.85, 31.85, 31.85, 31.85, 31.85, 31.71, 31.71, 31.71, 31.71, 31.71, 31.42, 31.42, 31.42, 31.42, 31.42, 31.06, 31.06, 31.06, 31.06, 31.06, 29.65, 29.65, 29.65, 29.65, 29.65, 29.37, 29.37, 29.37, 29.37, 29.37, 29.37, 29.37, 29.37, 29.37, 29.37, 29.4, 29.4, 29.4, 29.4, 29.4, 29.46, 29.46, 29.46, 29.46, 29.46, 29.58, 29.58, 29.58, 29.58, 29.58, 29.61, 29.61, 29.61, 29.61, 29.61, 29.57, 29.57, 29.57, 29.57, 29.57, 29.58, 29.58, 29.58, 29.58, 29.58, 29.45, 29.45, 29.45, 29.45, 29.45, 29.55, 29.55, 29.55, 29.55, 29.55, 29.69, 29.69, 29.69, 29.69, 29.69, 29.83, 29.83, 29.83, 29.83, 29.83, 29.9, 29.9, 29.9, 29.9, 29.9, 29.96, 29.96, 29.96, 29.96, 29.96, 29.97, 29.97, 29.97, 29.97, 29.97, 30.03, 30.03, 30.03]

Details

kv_cache_usage_ratio

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 557 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1717475210 --> 1717475834
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.14, 0.14, 0.14, 0.14, 0.14, 0.37, 0.37, 0.37, 0.37, 0.37, 0.25, 0.25, 0.25, 0.25, 0.25, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.14, 0.14, 0.14, 0.14, 0.14, 0.15, 0.15, 0.15, 0.15, 0.15, 0.14, 0.14, 0.14, 0.14, 0.14, 0.12, 0.12, 0.12, 0.12, 0.12, 0.16, 0.16, 0.16, 0.16, 0.16, 0.25, 0.25, 0.25, 0.25, 0.25, 0.26, 0.26, 0.26, 0.26, 0.26, 0.12, 0.12, 0.12, 0.12, 0.12, 0.36, 0.36, 0.36, 0.36, 0.36, 0.36, 0.36, 0.36, 0.36, 0.36, 0.41, 0.41, 0.41, 0.41, 0.41, 0.14, 0.14, 0.14, 0.14, 0.14, 0.15, 0.15, 0.15, 0.15, 0.15, 0.23, 0.23, 0.23, 0.23, 0.23, 0.2, 0.2, 0.2, 0.2, 0.2, 0.19, 0.19, 0.19, 0.19, 0.19, 0.16, 0.16, 0.16, 0.16, 0.16, 0.19, 0.19, 0.19, 0.19, 0.19, 0.11, 0.11, 0.11, 0.11, 0.11, 0.13, 0.13, 0.13, 0.13, 0.13, 0.32, 0.32, 0.32, 0.32, 0.32, 0.21, 0.21, 0.21, 0.21, 0.21, 0.1, 0.1, 0.1, 0.1, 0.1, 0.12, 0.12, 0.12, 0.12, 0.12, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.23, 0.23, 0.23, 0.23, 0.23, 0.21, 0.21, 0.21, 0.21, 0.21, 0.28, 0.28, 0.28, 0.28, 0.28, 0.3, 0.3, 0.3, 0.3, 0.3, 0.2, 0.2, 0.2, 0.2, 0.2, 0.09, 0.09, 0.09, 0.09, 0.09, 0.14, 0.14, 0.14, 0.14, 0.14, 0.13, 0.13, 0.13, 0.13, 0.13, 0.16, 0.16, 0.16, 0.16, 0.16, 0.45, 0.45, 0.45, 0.45, 0.45, 0.56, 0.56, 0.56, 0.56, 0.56, 0.56, 0.56, 0.56, 0.56, 0.56, 0.64, 0.64, 0.64, 0.64, 0.64, 0.36, 0.36, 0.36, 0.36, 0.36, 0.21, 0.21, 0.21, 0.21, 0.21, 0.2, 0.2, 0.2, 0.2, 0.2, 0.17, 0.17, 0.17, 0.17, 0.17, 0.13, 0.13, 0.13, 0.13, 0.13, 0.11, 0.11, 0.11, 0.11, 0.11, 0.29, 0.29, 0.29, 0.29, 0.29, 0.27, 0.27, 0.27, 0.27, 0.27, 0.24, 0.24, 0.24, 0.24, 0.24, 0.19, 0.19, 0.19, 0.19, 0.19, 0.13, 0.13, 0.13, 0.13, 0.13, 0.12, 0.12, 0.12, 0.12, 0.12, 0.17, 0.17, 0.17, 0.17, 0.17, 0.15, 0.15, 0.15, 0.15, 0.15, 0.22, 0.22, 0.22, 0.22, 0.22, 0.12, 0.12, 0.12, 0.12, 0.12, 0.16, 0.16, 0.16]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 557 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1717475210 --> 1717475834
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 2.0, 2.0, 2.0, 2.0, 2.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 1.0, 1.0, 1.0, 1.0, 1.0, 5.0, 5.0, 5.0, 5.0, 5.0, 2.0, 2.0, 2.0, 2.0, 2.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 2.0, 2.0, 2.0, 2.0, 2.0, 3.0, 3.0, 3.0, 3.0, 3.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0]

May 25 '24 04:05 github-actions[bot]

Great job! Works for me too, it's very fast. There were some warnings during compilation, but nothing major.

<EOT>Hello!

I'll get a new one for you and I think this is going to be really cool, so good. And I'm sure there's lots of ways in which [...]

llama_print_timings:        load time =     286.42 ms
llama_print_timings:      sample time =     155.94 ms /   256 runs   (    0.61 ms per token,  1641.63 tokens per second)
llama_print_timings: prompt eval time =      70.77 ms /     3 tokens (   23.59 ms per token,    42.39 tokens per second)
llama_print_timings:        eval time =    9368.54 ms /   255 runs   (   36.74 ms per token,    27.22 tokens per second)
llama_print_timings:       total time =    9686.16 ms /   258 tokens

May 25 '24 14:05 arch-btw

Amazing work! I initially tested Jamba-v0.1 on a machine with 500G RAM and it worked great!

./main -m ./Jamba-v0.1-hf-00001-of-00024.gguf -n 120 --prompt "def max(arr):" --temp 0
Log start
main: build = 3006 (fc59407e)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1716710334
llama_model_loader: additional 23 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 31 key-value pairs and 531 tensors from ./Jamba-v0.1-hf-00001-of-00024.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = jamba
llama_model_loader: - kv   1:                               general.name str              = Jamba-v0.1-hf
llama_model_loader: - kv   2:                          jamba.block_count u32              = 32
llama_model_loader: - kv   3:                       jamba.context_length u32              = 262144
llama_model_loader: - kv   4:                     jamba.embedding_length u32              = 4096
llama_model_loader: - kv   5:                  jamba.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 jamba.attention.head_count u32              = 32
llama_model_loader: - kv   7:              jamba.attention.head_count_kv arr[i32,32]      = [0, 0, 0, 0, 8, 0, 0, 0, 0, 0, 0, 0, ...
llama_model_loader: - kv   8:                      jamba.ssm.conv_kernel u32              = 4
llama_model_loader: - kv   9:                       jamba.ssm.inner_size u32              = 8192
llama_model_loader: - kv  10:                       jamba.ssm.state_size u32              = 16
llama_model_loader: - kv  11:                   jamba.ssm.time_step_rank u32              = 256
llama_model_loader: - kv  12:     jamba.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  13:                         jamba.expert_count u32              = 16
llama_model_loader: - kv  14:                    jamba.expert_used_count u32              = 2
llama_model_loader: - kv  15:                          general.file_type u32              = 32
llama_model_loader: - kv  16:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  17:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  18:                      tokenizer.ggml.tokens arr[str,65536]   = ["<|pad|>", "<|startoftext|>", "<|end...
llama_model_loader: - kv  19:                      tokenizer.ggml.scores arr[f32,65536]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  20:                  tokenizer.ggml.token_type arr[i32,65536]   = [3, 3, 3, 2, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  21:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  22:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  23:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  24:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  25:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  26:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  27:               general.quantization_version u32              = 2
llama_model_loader: - kv  28:                                   split.no u16              = 0
llama_model_loader: - kv  29:                                split.count u16              = 24
llama_model_loader: - kv  30:                        split.tensors.count i32              = 531
llama_model_loader: - type  f32:  361 tensors
llama_model_loader: - type bf16:  170 tensors
llm_load_vocab: special tokens definition check successful ( 1799/65536 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = jamba
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 65536
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 262144
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 0
llm_load_print_meta: n_embd_k_gqa     = 0
llm_load_print_meta: n_embd_v_gqa     = 0
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 16
llm_load_print_meta: n_expert_used    = 2
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = -1
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 262144
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 4
llm_load_print_meta: ssm_d_inner      = 8192
llm_load_print_meta: ssm_d_state      = 16
llm_load_print_meta: ssm_dt_rank      = 256
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = BF16
llm_load_print_meta: model params     = 51.57 B
llm_load_print_meta: model size       = 96.30 GiB (16.04 BPW) 
llm_load_print_meta: general.name     = Jamba-v0.1-hf
llm_load_print_meta: BOS token        = 1 '<|startoftext|>'
llm_load_print_meta: EOS token        = 2 '<|endoftext|>'
llm_load_print_meta: UNK token        = 3 '<|unk|>'
llm_load_print_meta: PAD token        = 0 '<|pad|>'
llm_load_print_meta: LF token         = 1554 '<0x0A>'
llm_load_print_meta: EOT token        = 2 '<|endoftext|>'
llm_load_tensors: ggml ctx size =    0.24 MiB
llm_load_tensors:        CPU buffer size =  4851.72 MiB
llm_load_tensors:        CPU buffer size =  4339.75 MiB
llm_load_tensors:        CPU buffer size =  4210.03 MiB
llm_load_tensors:        CPU buffer size =  3584.00 MiB
llm_load_tensors:        CPU buffer size =  4339.75 MiB
llm_load_tensors:        CPU buffer size =  5095.47 MiB
llm_load_tensors:        CPU buffer size =  3584.00 MiB
llm_load_tensors:        CPU buffer size =  3584.03 MiB
llm_load_tensors:        CPU buffer size =  4210.03 MiB
llm_load_tensors:        CPU buffer size =  3584.00 MiB
llm_load_tensors:        CPU buffer size =  4339.75 MiB
llm_load_tensors:        CPU buffer size =  4339.75 MiB
llm_load_tensors:        CPU buffer size =  3584.00 MiB
llm_load_tensors:        CPU buffer size =  4339.75 MiB
llm_load_tensors:        CPU buffer size =  4210.03 MiB
llm_load_tensors:        CPU buffer size =  3584.00 MiB
llm_load_tensors:        CPU buffer size =  4339.75 MiB
llm_load_tensors:        CPU buffer size =  4339.75 MiB
llm_load_tensors:        CPU buffer size =  3584.00 MiB
llm_load_tensors:        CPU buffer size =  4339.75 MiB
llm_load_tensors:        CPU buffer size =  4210.03 MiB
llm_load_tensors:        CPU buffer size =  3584.00 MiB
llm_load_tensors:        CPU buffer size =  4851.77 MiB
llm_load_tensors:        CPU buffer size =  3584.03 MiB
..............................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_cache_init:        CPU cache buf size =    24.63 MiB
llama_new_context_with_model: SSM state size =    16.62 MiB, R (f32):    2.62 MiB, S (f32):   14.00 MiB
llama_new_context_with_model: KV cache size  =     8.00 MiB, K (f16):    4.00 MiB, V (f16):    4.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.25 MiB
llama_new_context_with_model:        CPU compute buffer size =   145.10 MiB
llama_new_context_with_model: graph nodes  = 1730
llama_new_context_with_model: graph splits = 1

system_info: n_threads = 32 / 64 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
sampling: 
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.000
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature 
generate: n_ctx = 512, n_batch = 2048, n_predict = 120, n_keep = 1


<|startoftext|> def max(arr):
    return max(arr)


def min(arr):
    return min(arr)


def mean(arr):
    return sum(arr) / len(arr)


def median(arr):
    arr.sort()
    if len(arr) % 2 == 0:
        return (arr[len(arr) // 2] + arr[len(arr) // 2 - 1]) / 2
    else:
        return arr[len(arr) // 2]


llama_print_timings:        load time =   82494.54 ms
llama_print_timings:      sample time =       9.61 ms /   120 runs   (    0.08 ms per token, 12490.89 tokens per second)
llama_print_timings: prompt eval time =     666.33 ms /     6 tokens (  111.06 ms per token,     9.00 tokens per second)
llama_print_timings:        eval time =   27656.31 ms /   119 runs   (  232.41 ms per token,     4.30 tokens per second)
llama_print_timings:       total time =   28862.18 ms /   125 tokens
Log end

May 26 '24 08:05 TechxGenus

Jumping in to add some results on trying to quant smaller than Q8_0. So far, any of the standard Jamba models (52B) and the smaller community-built models can be converted to Q8_0 just fine and then loaded for inference. Anytime I try to make a smaller quant I end up with this though. I have a decent rig (M2 Ultra 128GB) so I can keep testing and help to troubleshoot if needed

./quantize /Volumes/Severian/Jamba-v0.1-Claude-Chat-gguf/Jamba-v0.1-Claude-Chat.f16.gguf Jamba-v0.1-Claude-Chat-q6_k.gguf q6_k
main: build = 0 (unknown)
main: built with Apple clang version 15.0.0 (clang-1500.3.9.4) for arm64-apple-darwin23.4.0
main: quantizing '/Volumes/Severian/Jamba-v0.1-Claude-Chat-gguf/Jamba-v0.1-Claude-Chat.f16.gguf' to 'Jamba-v0.1-Claude-Chat-q6_k.gguf' as Q6_K
llama_model_loader: loaded meta data with 28 key-value pairs and 531 tensors from /Volumes/Severian/Jamba-v0.1-Claude-Chat-gguf/Jamba-v0.1-Claude-Chat.f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = jamba
llama_model_loader: - kv   1:                               general.name str              = Norquinal_Jamba-v0.1-Claude-Chat
llama_model_loader: - kv   2:                          jamba.block_count u32              = 32
llama_model_loader: - kv   3:                       jamba.context_length u32              = 262144
llama_model_loader: - kv   4:                     jamba.embedding_length u32              = 4096
llama_model_loader: - kv   5:                  jamba.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 jamba.attention.head_count u32              = 32
llama_model_loader: - kv   7:              jamba.attention.head_count_kv arr[i32,32]      = [0, 0, 0, 0, 8, 0, 0, 0, 0, 0, 0, 0, ...
llama_model_loader: - kv   8:                      jamba.ssm.conv_kernel u32              = 4
llama_model_loader: - kv   9:                       jamba.ssm.inner_size u32              = 8192
llama_model_loader: - kv  10:                       jamba.ssm.state_size u32              = 16
llama_model_loader: - kv  11:                   jamba.ssm.time_step_rank u32              = 256
llama_model_loader: - kv  12:     jamba.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  13:                         jamba.expert_count u32              = 16
llama_model_loader: - kv  14:                    jamba.expert_used_count u32              = 2
llama_model_loader: - kv  15:                          general.file_type u32              = 1
llama_model_loader: - kv  16:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  17:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  18:                      tokenizer.ggml.tokens arr[str,65536]   = ["<|pad|>", "<|startoftext|>", "<|end...
llama_model_loader: - kv  19:                      tokenizer.ggml.scores arr[f32,65536]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  20:                  tokenizer.ggml.token_type arr[i32,65536]   = [3, 3, 3, 2, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  21:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  22:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  23:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  24:            tokenizer.ggml.padding_token_id u32              = 2
llama_model_loader: - kv  25:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  26:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  27:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  361 tensors
llama_model_loader: - type  f16:  170 tensors
GGML_ASSERT: llama.cpp:16297: (qs.n_attention_wv == 0 || qs.n_attention_wv == (int)model.hparams.n_layer) && "n_attention_wv is unexpected"
zsh: abort      ./quantize  Jamba-v0.1-Claude-Chat-q6_k.gguf q6_k

May 28 '24 11:05 severian42

@compilade I tried out the latest changes and the further quant into q6_k works! I haven't tried the others but will do so and let you know

main: built with Apple clang version 15.0.0 (clang-1500.3.9.4) for arm64-apple-darwin23.4.0
main: quantizing '/Volumes/Severian/Jamba-v0.1-Claude-Chat-gguf/Jamba-v0.1-Claude-Chat.f16.gguf' to 'Jamba-v0.1-Claude-Chat-q6_k.gguf' as Q6_K
llama_model_loader: loaded meta data with 28 key-value pairs and 531 tensors from /Volumes/Severian/Jamba-v0.1-Claude-Chat-gguf/Jamba-v0.1-Claude-Chat.f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = jamba
llama_model_loader: - kv   1:                               general.name str              = Norquinal_Jamba-v0.1-Claude-Chat
llama_model_loader: - kv   2:                          jamba.block_count u32              = 32
llama_model_loader: - kv   3:                       jamba.context_length u32              = 262144
llama_model_loader: - kv   4:                     jamba.embedding_length u32              = 4096
llama_model_loader: - kv   5:                  jamba.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 jamba.attention.head_count u32              = 32
llama_model_loader: - kv   7:              jamba.attention.head_count_kv arr[i32,32]      = [0, 0, 0, 0, 8, 0, 0, 0, 0, 0, 0, 0, ...
llama_model_loader: - kv   8:                      jamba.ssm.conv_kernel u32              = 4
llama_model_loader: - kv   9:                       jamba.ssm.inner_size u32              = 8192
llama_model_loader: - kv  10:                       jamba.ssm.state_size u32              = 16
llama_model_loader: - kv  11:                   jamba.ssm.time_step_rank u32              = 256
llama_model_loader: - kv  12:     jamba.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  13:                         jamba.expert_count u32              = 16
llama_model_loader: - kv  14:                    jamba.expert_used_count u32              = 2
llama_model_loader: - kv  15:                          general.file_type u32              = 1
llama_model_loader: - kv  16:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  17:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  18:                      tokenizer.ggml.tokens arr[str,65536]   = ["<|pad|>", "<|startoftext|>", "<|end...
llama_model_loader: - kv  19:                      tokenizer.ggml.scores arr[f32,65536]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  20:                  tokenizer.ggml.token_type arr[i32,65536]   = [3, 3, 3, 2, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  21:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  22:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  23:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  24:            tokenizer.ggml.padding_token_id u32              = 2
llama_model_loader: - kv  25:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  26:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  27:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  361 tensors
llama_model_loader: - type  f16:  170 tensors
[   1/ 531]                    token_embd.weight - [ 4096, 65536,     1,     1], type =    f16, converting to q6_K .. size =   512.00 MiB ->   210.00 MiB
[   2/ 531]                blk.0.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q6_K .. size =   112.00 MiB ->    45.94 MiB
[   3/ 531]                blk.0.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q6_K .. size =   112.00 MiB ->    45.94 MiB
[   4/ 531]                  blk.0.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q6_K .. size =   112.00 MiB ->    45.94 MiB
[   5/ 531]               blk.0.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[   6/ 531]                          blk.0.ssm_a - [   16,  8192,     1,     1], type =    f32, size =    0.500 MB
[   7/ 531]                          blk.0.ssm_d - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[   8/ 531]              blk.0.ssm_b_norm.weight - [   16,     1,     1,     1], type =    f32, size =    0.000 MB
[   9/ 531]              blk.0.ssm_c_norm.weight - [   16,     1,     1,     1], type =    f32, size =    0.000 MB
[  10/ 531]                blk.0.ssm_conv1d.bias - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[  11/ 531]              blk.0.ssm_conv1d.weight - [    4,  8192,     1,     1], type =    f32, size =    0.125 MB
[  12/ 531]             blk.0.ssm_dt_norm.weight - [  256,     1,     1,     1], type =    f32, size =    0.001 MB
[  13/ 531]                    blk.0.ssm_dt.bias - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[  14/ 531]                  blk.0.ssm_dt.weight - [  256,  8192,     1,     1], type =    f32, size =    8.000 MB
[  15/ 531]                  blk.0.ssm_in.weight - [ 4096, 16384,     1,     1], type =    f16, converting to q6_K .. size =   128.00 MiB ->    52.50 MiB
[  16/ 531]                 blk.0.ssm_out.weight - [ 8192,  4096,     1,     1], type =    f16, converting to q6_K .. size =    64.00 MiB ->    26.25 MiB
[  17/ 531]                   blk.0.ssm_x.weight - [ 8192,   288,     1,     1], type =    f32, size =    9.000 MB
[  18/ 531]                blk.0.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  19/ 531]            blk.1.ffn_gate_inp.weight - [ 4096,    16,     1,     1], type =    f32, size =    0.250 MB
[  20/ 531]                          blk.1.ssm_a - [   16,  8192,     1,     1], type =    f32, size =    0.500 MB
[  21/ 531]                          blk.1.ssm_d - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[  22/ 531]              blk.1.ssm_b_norm.weight - [   16,     1,     1,     1], type =    f32, size =    0.000 MB
[  23/ 531]              blk.1.ssm_c_norm.weight - [   16,     1,     1,     1], type =    f32, size =    0.000 MB
[  24/ 531]                blk.1.ssm_conv1d.bias - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[  25/ 531]              blk.1.ssm_conv1d.weight - [    4,  8192,     1,     1], type =    f32, size =    0.125 MB
[  26/ 531]             blk.1.ssm_dt_norm.weight - [  256,     1,     1,     1], type =    f32, size =    0.001 MB
[  27/ 531]                    blk.1.ssm_dt.bias - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[  28/ 531]                  blk.1.ssm_dt.weight - [  256,  8192,     1,     1], type =    f32, size =    8.000 MB
[  29/ 531]                  blk.1.ssm_in.weight - [ 4096, 16384,     1,     1], type =    f16, converting to q6_K .. size =   128.00 MiB ->    52.50 MiB
[  30/ 531]                 blk.1.ssm_out.weight - [ 8192,  4096,     1,     1], type =    f16, converting to q6_K .. size =    64.00 MiB ->    26.25 MiB
[  31/ 531]                   blk.1.ssm_x.weight - [ 8192,   288,     1,     1], type =    f32, size =    9.000 MB
[  32/ 531]           blk.1.ffn_down_exps.weight - [14336,  4096,    16,     1], type =    f16, converting to q6_K .. size =  1792.00 MiB ->   735.00 MiB
[  33/ 531]           blk.1.ffn_gate_exps.weight - [ 4096, 14336,    16,     1], type =    f16, converting to q6_K .. size =  1792.00 MiB ->   735.00 MiB
[  34/ 531]             blk.1.ffn_up_exps.weight - [ 4096, 14336,    16,     1], type =    f16, converting to q6_K .. size =  1792.00 MiB ->   735.00 MiB
[  35/ 531]               blk.1.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  36/ 531]                blk.1.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  37/ 531]                blk.2.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q6_K .. size =   112.00 MiB ->    45.94 MiB
[  38/ 531]                blk.2.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q6_K .. size =   112.00 MiB ->    45.94 MiB
[  39/ 531]                  blk.2.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q6_K .. size =   112.00 MiB ->    45.94 MiB
[  40/ 531]               blk.2.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  41/ 531]                          blk.2.ssm_a - [   16,  8192,     1,     1], type =    f32, size =    0.500 MB
[  42/ 531]                          blk.2.ssm_d - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[  43/ 531]              blk.2.ssm_b_norm.weight - [   16,     1,     1,     1], type =    f32, size =    0.000 MB
[  44/ 531]              blk.2.ssm_c_norm.weight - [   16,     1,     1,     1], type =    f32, size =    0.000 MB
[  45/ 531]                blk.2.ssm_conv1d.bias - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[  46/ 531]              blk.2.ssm_conv1d.weight - [    4,  8192,     1,     1], type =    f32, size =    0.125 MB
[  47/ 531]             blk.2.ssm_dt_norm.weight - [  256,     1,     1,     1], type =    f32, size =    0.001 MB
[  48/ 531]                    blk.2.ssm_dt.bias - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[  49/ 531]                  blk.2.ssm_dt.weight - [  256,  8192,     1,     1], type =    f32, size =    8.000 MB
[  50/ 531]                  blk.2.ssm_in.weight - [ 4096, 16384,     1,     1], type =    f16, converting to q6_K .. size =   128.00 MiB ->    52.50 MiB
[  51/ 531]                 blk.2.ssm_out.weight - [ 8192,  4096,     1,     1], type =    f16, converting to q6_K .. size =    64.00 MiB ->    26.25 MiB
[  52/ 531]                   blk.2.ssm_x.weight - [ 8192,   288,     1,     1], type =    f32, size =    9.000 MB
[  53/ 531]                blk.2.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  54/ 531]            blk.3.ffn_gate_inp.weight - [ 4096,    16,     1,     1], type =    f32, size =    0.250 MB
[  55/ 531]                          blk.3.ssm_a - [   16,  8192,     1,     1], type =    f32, size =    0.500 MB
[  56/ 531]                          blk.3.ssm_d - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[  57/ 531]              blk.3.ssm_b_norm.weight - [   16,     1,     1,     1], type =    f32, size =    0.000 MB
[  58/ 531]              blk.3.ssm_c_norm.weight - [   16,     1,     1,     1], type =    f32, size =    0.000 MB
[  59/ 531]                blk.3.ssm_conv1d.bias - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[  60/ 531]              blk.3.ssm_conv1d.weight - [    4,  8192,     1,     1], type =    f32, size =    0.125 MB
[  61/ 531]             blk.3.ssm_dt_norm.weight - [  256,     1,     1,     1], type =    f32, size =    0.001 MB
[  62/ 531]                    blk.3.ssm_dt.bias - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[  63/ 531]                  blk.3.ssm_dt.weight - [  256,  8192,     1,     1], type =    f32, size =    8.000 MB
[  64/ 531]                  blk.3.ssm_in.weight - [ 4096, 16384,     1,     1], type =    f16, converting to q6_K .. size =   128.00 MiB ->    52.50 MiB
[  65/ 531]                 blk.3.ssm_out.weight - [ 8192,  4096,     1,     1], type =    f16, converting to q6_K .. size =    64.00 MiB ->    26.25 MiB
[  66/ 531]                   blk.3.ssm_x.weight - [ 8192,   288,     1,     1], type =    f32, size =    9.000 MB
[  67/ 531]           blk.3.ffn_down_exps.weight - [14336,  4096,    16,     1], type =    f16, converting to q6_K .. size =  1792.00 MiB ->   735.00 MiB
[  68/ 531]           blk.3.ffn_gate_exps.weight - [ 4096, 14336,    16,     1], type =    f16, converting to q6_K .. size =  1792.00 MiB ->   735.00 MiB
[  69/ 531]             blk.3.ffn_up_exps.weight - [ 4096, 14336,    16,     1], type =    f16, converting to q6_K .. size =  1792.00 MiB ->   735.00 MiB
[  70/ 531]               blk.3.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  71/ 531]                blk.3.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  72/ 531]                blk.4.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q6_K .. size =   112.00 MiB ->    45.94 MiB
[  73/ 531]                blk.4.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q6_K .. size =   112.00 MiB ->    45.94 MiB
[  74/ 531]                  blk.4.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q6_K .. size =   112.00 MiB ->    45.94 MiB
[  75/ 531]               blk.4.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  76/ 531]                blk.4.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  77/ 531]                  blk.4.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q6_K .. size =     8.00 MiB ->     3.28 MiB
[  78/ 531]             blk.4.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q6_K .. size =    32.00 MiB ->    13.12 MiB
[  79/ 531]                  blk.4.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q6_K .. size =    32.00 MiB ->    13.12 MiB
[  80/ 531]                  blk.4.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q6_K .. size =     8.00 MiB ->     3.28 MiB
[  81/ 531]            blk.5.ffn_gate_inp.weight - [ 4096,    16,     1,     1], type =    f32, size =    0.250 MB
[  82/ 531]                          blk.5.ssm_a - [   16,  8192,     1,     1], type =    f32, size =    0.500 MB
[  83/ 531]                          blk.5.ssm_d - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[  84/ 531]              blk.5.ssm_b_norm.weight - [   16,     1,     1,     1], type =    f32, size =    0.000 MB
[  85/ 531]              blk.5.ssm_c_norm.weight - [   16,     1,     1,     1], type =    f32, size =    0.000 MB
[  86/ 531]                blk.5.ssm_conv1d.bias - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[  87/ 531]              blk.5.ssm_conv1d.weight - [    4,  8192,     1,     1], type =    f32, size =    0.125 MB
[  88/ 531]             blk.5.ssm_dt_norm.weight - [  256,     1,     1,     1], type =    f32, size =    0.001 MB
[  89/ 531]                    blk.5.ssm_dt.bias - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[  90/ 531]                  blk.5.ssm_dt.weight - [  256,  8192,     1,     1], type =    f32, size =    8.000 MB
[  91/ 531]                  blk.5.ssm_in.weight - [ 4096, 16384,     1,     1], type =    f16, converting to q6_K .. size =   128.00 MiB ->    52.50 MiB
[  92/ 531]                 blk.5.ssm_out.weight - [ 8192,  4096,     1,     1], type =    f16, converting to q6_K .. size =    64.00 MiB ->    26.25 MiB
[  93/ 531]                   blk.5.ssm_x.weight - [ 8192,   288,     1,     1], type =    f32, size =    9.000 MB
[  94/ 531]           blk.5.ffn_down_exps.weight - [14336,  4096,    16,     1], type =    f16, converting to q6_K .. size =  1792.00 MiB ->   735.00 MiB
[  95/ 531]           blk.5.ffn_gate_exps.weight - [ 4096, 14336,    16,     1], type =    f16, converting to q6_K .. size =  1792.00 MiB ->   735.00 MiB
[  96/ 531]             blk.5.ffn_up_exps.weight - [ 4096, 14336,    16,     1], type =    f16, converting to q6_K .. size =  1792.00 MiB ->   735.00 MiB
[  97/ 531]               blk.5.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  98/ 531]                blk.5.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  99/ 531]                          blk.6.ssm_a - [   16,  8192,     1,     1], type =    f32, size =    0.500 MB
[ 100/ 531]                          blk.6.ssm_d - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[ 101/ 531]                blk.6.ssm_conv1d.bias - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[ 102/ 531]              blk.6.ssm_conv1d.weight - [    4,  8192,     1,     1], type =    f32, size =    0.125 MB
[ 103/ 531]                    blk.6.ssm_dt.bias - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[ 104/ 531]                  blk.6.ssm_dt.weight - [  256,  8192,     1,     1], type =    f32, size =    8.000 MB
[ 105/ 531]                  blk.6.ssm_in.weight - [ 4096, 16384,     1,     1], type =    f16, converting to q6_K .. size =   128.00 MiB ->    52.50 MiB
[ 106/ 531]                   blk.6.ssm_x.weight - [ 8192,   288,     1,     1], type =    f32, size =    9.000 MB
[ 107/ 531]                blk.6.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q6_K .. size =   112.00 MiB ->    45.94 MiB
[ 108/ 531]                blk.6.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q6_K .. size =   112.00 MiB ->    45.94 MiB
[ 109/ 531]                  blk.6.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q6_K .. size =   112.00 MiB ->    45.94 MiB
[ 110/ 531]               blk.6.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 111/ 531]              blk.6.ssm_b_norm.weight - [   16,     1,     1,     1], type =    f32, size =    0.000 MB
[ 112/ 531]              blk.6.ssm_c_norm.weight - [   16,     1,     1,     1], type =    f32, size =    0.000 MB
[ 113/ 531]             blk.6.ssm_dt_norm.weight - [  256,     1,     1,     1], type =    f32, size =    0.001 MB
[ 114/ 531]                 blk.6.ssm_out.weight - [ 8192,  4096,     1,     1], type =    f16, converting to q6_K .. size =    64.00 MiB ->    26.25 MiB
[ 115/ 531]                blk.6.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 116/ 531]            blk.7.ffn_gate_inp.weight - [ 4096,    16,     1,     1], type =    f32, size =    0.250 MB
[ 117/ 531]                          blk.7.ssm_a - [   16,  8192,     1,     1], type =    f32, size =    0.500 MB
[ 118/ 531]                          blk.7.ssm_d - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[ 119/ 531]              blk.7.ssm_b_norm.weight - [   16,     1,     1,     1], type =    f32, size =    0.000 MB
[ 120/ 531]              blk.7.ssm_c_norm.weight - [   16,     1,     1,     1], type =    f32, size =    0.000 MB
[ 121/ 531]                blk.7.ssm_conv1d.bias - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[ 122/ 531]              blk.7.ssm_conv1d.weight - [    4,  8192,     1,     1], type =    f32, size =    0.125 MB
[ 123/ 531]             blk.7.ssm_dt_norm.weight - [  256,     1,     1,     1], type =    f32, size =    0.001 MB
[ 124/ 531]                    blk.7.ssm_dt.bias - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[ 125/ 531]                  blk.7.ssm_dt.weight - [  256,  8192,     1,     1], type =    f32, size =    8.000 MB
[ 126/ 531]                  blk.7.ssm_in.weight - [ 4096, 16384,     1,     1], type =    f16, converting to q6_K .. size =   128.00 MiB ->    52.50 MiB
[ 127/ 531]                 blk.7.ssm_out.weight - [ 8192,  4096,     1,     1], type =    f16, converting to q6_K .. size =    64.00 MiB ->    26.25 MiB
[ 128/ 531]                   blk.7.ssm_x.weight - [ 8192,   288,     1,     1], type =    f32, size =    9.000 MB
[ 129/ 531]           blk.7.ffn_down_exps.weight - [14336,  4096,    16,     1], type =    f16, converting to q6_K .. size =  1792.00 MiB ->   735.00 MiB
[ 130/ 531]           blk.7.ffn_gate_exps.weight - [ 4096, 14336,    16,     1], type =    f16, converting to q6_K .. size =  1792.00 MiB ->   735.00 MiB
[ 131/ 531]             blk.7.ffn_up_exps.weight - [ 4096, 14336,    16,     1], type =    f16, converting to q6_K .. si
ze =  1792.00 MiB ->   735.00 MiB
[ 132/ 531]               blk.7.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 133/ 531]                blk.7.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 134/ 531]                blk.8.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q6_K .. size =   112.00 MiB ->    45.94 MiB
[ 135/ 531]                blk.8.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q6_K .. size =   112.00 MiB ->    45.94 MiB
[ 136/ 531]                  blk.8.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q6_K .. size =   112.00 MiB ->    45.94 MiB
[ 137/ 531]               blk.8.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 138/ 531]                          blk.8.ssm_a - [   16,  8192,     1,     1], type =    f32, size =    0.500 MB
[ 139/ 531]                          blk.8.ssm_d - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[ 140/ 531]              blk.8.ssm_b_norm.weight - [   16,     1,     1,     1], type =    f32, size =    0.000 MB
[ 141/ 531]              blk.8.ssm_c_norm.weight - [   16,     1,     1,     1], type =    f32, size =    0.000 MB
[ 142/ 531]                blk.8.ssm_conv1d.bias - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[ 143/ 531]              blk.8.ssm_conv1d.weight - [    4,  8192,     1,     1], type =    f32, size =    0.125 MB
[ 144/ 531]             blk.8.ssm_dt_norm.weight - [  256,     1,     1,     1], type =    f32, size =    0.001 MB
[ 145/ 531]                    blk.8.ssm_dt.bias - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[ 146/ 531]                  blk.8.ssm_dt.weight - [  256,  8192,     1,     1], type =    f32, size =    8.000 MB
[ 147/ 531]                  blk.8.ssm_in.weight - [ 4096, 16384,     1,     1], type =    f16, converting to q6_K .. size =   128.00 MiB ->    52.50 MiB
[ 148/ 531]                 blk.8.ssm_out.weight - [ 8192,  4096,     1,     1], type =    f16, converting to q6_K .. size =    64.00 MiB ->    26.25 MiB
[ 149/ 531]                   blk.8.ssm_x.weight - [ 8192,   288,     1,     1], type =    f32, size =    9.000 MB
[ 150/ 531]                blk.8.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 151/ 531]            blk.9.ffn_gate_inp.weight - [ 4096,    16,     1,     1], type =    f32, size =    0.250 MB
[ 152/ 531]                          blk.9.ssm_a - [   16,  8192,     1,     1], type =    f32, size =    0.500 MB
[ 153/ 531]                          blk.9.ssm_d - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[ 154/ 531]              blk.9.ssm_b_norm.weight - [   16,     1,     1,     1], type =    f32, size =    0.000 MB
[ 155/ 531]              blk.9.ssm_c_norm.weight - [   16,     1,     1,     1], type =    f32, size =    0.000 MB
[ 156/ 531]                blk.9.ssm_conv1d.bias - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[ 157/ 531]              blk.9.ssm_conv1d.weight - [    4,  8192,     1,     1], type =    f32, size =    0.125 MB
[ 158/ 531]             blk.9.ssm_dt_norm.weight - [  256,     1,     1,     1], type =    f32, size =    0.001 MB
[ 159/ 531]                    blk.9.ssm_dt.bias - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[ 160/ 531]                  blk.9.ssm_dt.weight - [  256,  8192,     1,     1], type =    f32, size =    8.000 MB
[ 161/ 531]                  blk.9.ssm_in.weight - [ 4096, 16384,     1,     1], type =    f16, converting to q6_K .. size =   128.00 MiB ->    52.50 MiB
[ 162/ 531]                 blk.9.ssm_out.weight - [ 8192,  4096,     1,     1], type =    f16, converting to q6_K .. size =    64.00 MiB ->    26.25 MiB
[ 163/ 531]                   blk.9.ssm_x.weight - [ 8192,   288,     1,     1], type =    f32, size =    9.000 MB
[ 164/ 531]               blk.10.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q6_K .. size =   112.00 MiB ->    45.94 MiB
[ 165/ 531]               blk.10.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q6_K .. size =   112.00 MiB ->    45.94 MiB
[ 166/ 531]                 blk.10.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q6_K .. size =   112.00 MiB ->    45.94 MiB
[ 167/ 531]              blk.10.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 168/ 531]                         blk.10.ssm_a - [   16,  8192,     1,     1], type =    f32, size =    0.500 MB
[ 169/ 531]                         blk.10.ssm_d - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[ 170/ 531]             blk.10.ssm_b_norm.weight - [   16,     1,     1,     1], type =    f32, size =    0.000 MB
[ 171/ 531]             blk.10.ssm_c_norm.weight - [   16,     1,     1,     1], type =    f32, size =    0.000 MB
[ 172/ 531]               blk.10.ssm_conv1d.bias - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[ 173/ 531]             blk.10.ssm_conv1d.weight - [    4,  8192,     1,     1], type =    f32, size =    0.125 MB
[ 174/ 531]            blk.10.ssm_dt_norm.weight - [  256,     1,     1,     1], type =    f32, size =    0.001 MB
[ 175/ 531]                   blk.10.ssm_dt.bias - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[ 176/ 531]                 blk.10.ssm_dt.weight - [  256,  8192,     1,     1], type =    f32, size =    8.000 MB
[ 177/ 531]                 blk.10.ssm_in.weight - [ 4096, 16384,     1,     1], type =    f16, converting to q6_K .. size =   128.00 MiB ->    52.50 MiB
[ 178/ 531]                blk.10.ssm_out.weight - [ 8192,  4096,     1,     1], type =    f16, converting to q6_K .. size =    64.00 MiB ->    26.25 MiB
[ 179/ 531]                  blk.10.ssm_x.weight - [ 8192,   288,     1,     1], type =    f32, size =    9.000 MB
[ 180/ 531]               blk.10.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 181/ 531]           blk.11.ffn_gate_inp.weight - [ 4096,    16,     1,     1], type =    f32, size =    0.250 MB
[ 182/ 531]                         blk.11.ssm_a - [   16,  8192,     1,     1], type =    f32, size =    0.500 MB
[ 183/ 531]                         blk.11.ssm_d - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[ 184/ 531]             blk.11.ssm_b_norm.weight - [   16,     1,     1,     1], type =    f32, size =    0.000 MB
[ 185/ 531]             blk.11.ssm_c_norm.weight - [   16,     1,     1,     1], type =    f32, size =    0.000 MB
[ 186/ 531]               blk.11.ssm_conv1d.bias - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[ 187/ 531]             blk.11.ssm_conv1d.weight - [    4,  8192,     1,     1], type =    f32, size =    0.125 MB
[ 188/ 531]            blk.11.ssm_dt_norm.weight - [  256,     1,     1,     1], type =    f32, size =    0.001 MB
[ 189/ 531]                   blk.11.ssm_dt.bias - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[ 190/ 531]                 blk.11.ssm_dt.weight - [  256,  8192,     1,     1], type =    f32, size =    8.000 MB
[ 191/ 531]                 blk.11.ssm_in.weight - [ 4096, 16384,     1,     1], type =    f16, converting to q6_K .. size =   128.00 MiB ->    52.50 MiB
[ 192/ 531]                blk.11.ssm_out.weight - [ 8192,  4096,     1,     1], type =    f16, converting to q6_K .. size =    64.00 MiB ->    26.25 MiB
[ 193/ 531]                  blk.11.ssm_x.weight - [ 8192,   288,     1,     1], type =    f32, size =    9.000 MB
[ 194/ 531]           blk.9.ffn_down_exps.weight - [14336,  4096,    16,     1], type =    f16, converting to q6_K .. size =  1792.00 MiB ->   735.00 MiB
[ 195/ 531]           blk.9.ffn_gate_exps.weight - [ 4096, 14336,    16,     1], type =    f16, converting to q6_K .. size =  1792.00 MiB ->   735.00 MiB
[ 196/ 531]             blk.9.ffn_up_exps.weight - [ 4096, 14336,    16,     1], type =    f16, converting to q6_K .. size =  1792.00 MiB ->   735.00 MiB
[ 197/ 531]               blk.9.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 198/ 531]                blk.9.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 199/ 531]          blk.11.ffn_down_exps.weight - [14336,  4096,    16,     1], type =    f16, converting to q6_K .. size =  1792.00 MiB ->   735.00 MiB
[ 200/ 531]          blk.11.ffn_gate_exps.weight - [ 4096, 14336,    16,     1], type =    f16, converting to q6_K .. size =  1792.00 MiB ->   735.00 MiB
[ 201/ 531]            blk.11.ffn_up_exps.weight - [ 4096, 14336,    16,     1], type =    f16, converting to q6_K .. size =  1792.00 MiB ->   735.00 MiB
[ 202/ 531]              blk.11.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 203/ 531]               blk.11.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 204/ 531]               blk.12.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q6_K .. size =   112.00 MiB ->    45.94 MiB
[ 205/ 531]               blk.12.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q6_K .. size =   112.00 MiB ->    45.94 MiB
[ 206/ 531]                 blk.12.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q6_K .. size =   112.00 MiB ->    45.94 MiB
[ 207/ 531]              blk.12.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 208/ 531]               blk.12.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 209/ 531]                 blk.12.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q6_K .. size =     8.00 MiB ->     3.28 MiB
[ 210/ 531]            blk.12.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q6_K .. size =    32.00 MiB ->    13.12 MiB
[ 211/ 531]                 blk.12.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q6_K .. size =    32.00 MiB ->    13.12 MiB
[ 212/ 531]                 blk.12.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q6_K .. size =     8.00 MiB ->     3.28 MiB
[ 213/ 531]           blk.13.ffn_gate_inp.weight - [ 4096,    16,     1,     1], type =    f32, size =    0.250 MB
[ 214/ 531]                         blk.13.ssm_a - [   16,  8192,     1,     1], type =    f32, size =    0.500 MB
[ 215/ 531]                         blk.13.ssm_d - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[ 216/ 531]             blk.13.ssm_b_norm.weight - [   16,     1,     1,     1], type =    f32, size =    0.000 MB
[ 217/ 531]             blk.13.ssm_c_norm.weight - [   16,     1,     1,     1], type =    f32, size =    0.000 MB
[ 218/ 531]               blk.13.ssm_conv1d.bias - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[ 219/ 531]             blk.13.ssm_conv1d.weight - [    4,  8192,     1,     1], type =    f32, size =    0.125 MB
[ 220/ 531]            blk.13.ssm_dt_norm.weight - [  256,     1,     1,     1], type =    f32, size =    0.001 MB
[ 221/ 531]                   blk.13.ssm_dt.bias - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[ 222/ 531]                 blk.13.ssm_dt.weight - [  256,  8192,     1,     1], type =    f32, size =    8.000 MB
[ 223/ 531]                 blk.13.ssm_in.weight - [ 4096, 16384,     1,     1], type =    f16, converting to q6_K .. size =   128.00 MiB ->    52.50 MiB
[ 224/ 531]                blk.13.ssm_out.weight - [ 8192,  4096,     1,     1], type =    f16, converting to q6_K .. size =    64.00 MiB ->    26.25 MiB
[ 225/ 531]                  blk.13.ssm_x.weight - [ 8192,   288,     1,     1], type =    f32, size =    9.000 MB
[ 226/ 531]          blk.13.ffn_down_exps.weight - [14336,  4096,    16,     1], type =    f16, converting to q6_K .. size =  1792.00 MiB ->   735.00 MiB
[ 227/ 531]          blk.13.ffn_gate_exps.weight - [ 4096, 14336,    16,     1], type =    f16, converting to q6_K .. size =  1792.00 MiB ->   735.00 MiB
[ 228/ 531]            blk.13.ffn_up_exps.weight - [ 4096, 14336,    16,     1], type =    f16, converting to q6_K .. size =  1792.00 MiB ->   735.00 MiB
[ 229/ 531]              blk.13.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 230/ 531]               blk.13.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 231/ 531]               blk.14.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q6_K .. size =   112.00 MiB ->    45.94 MiB
[ 232/ 531]               blk.14.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q6_K .. size =   112.00 MiB ->    45.94 MiB
[ 233/ 531]                 blk.14.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q6_K .. size =   112.00 MiB ->    45.94 MiB
[ 234/ 531]              blk.14.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 235/ 531]                         blk.14.ssm_a - [   16,  8192,     1,     1], type =    f32, size =    0.500 MB
[ 236/ 531]                         blk.14.ssm_d - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[ 237/ 531]             blk.14.ssm_b_norm.weight - [   16,     1,     1,     1], type =    f32, size =    0.000 MB
[ 238/ 531]             blk.14.ssm_c_norm.weight - [   16,     1,     1,     1], type =    f32, size =    0.000 MB
[ 239/ 531]               blk.14.ssm_conv1d.bias - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[ 240/ 531]             blk.14.ssm_conv1d.weight - [    4,  8192,     1,     1], type =    f32, size =    0.125 MB
[ 241/ 531]            blk.14.ssm_dt_norm.weight - [  256,     1,     1,     1], type =    f32, size =    0.001 MB
[ 242/ 531]                   blk.14.ssm_dt.bias - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[ 243/ 531]                 blk.14.ssm_dt.weight - [  256,  8192,     1,     1], type =    f32, size =    8.000 MB
[ 244/ 531]                 blk.14.ssm_in.weight - [ 4096, 16384,     1,     1], type =    f16, converting to q6_K .. size =   128.00 MiB ->    52.50 MiB
[ 245/ 531]                blk.14.ssm_out.weight - [ 8192,  4096,     1,     1], type =    f16, converting to q6_K .. size =    64.00 MiB ->    26.25 MiB
[ 246/ 531]                  blk.14.ssm_x.weight - [ 8192,   288,     1,     1], type =    f32, size =    9.000 MB
[ 247/ 531]               blk.14.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 248/ 531]           blk.15.ffn_gate_inp.weight - [ 4096,    16,     1,     1], type =    f32, size =    0.250 MB
[ 249/ 531]                         blk.15.ssm_a - [   16,  8192,     1,     1], type =    f32, size =    0.500 MB
[ 250/ 531]                         blk.15.ssm_d - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[ 251/ 531]             blk.15.ssm_b_norm.weight - [   16,     1,     1,     1], type =    f32, size =    0.000 MB
[ 252/ 531]             blk.15.ssm_c_norm.weight - [   16,     1,     1,     1], type =    f32, size =    0.000 MB
[ 253/ 531]               blk.15.ssm_conv1d.bias - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[ 254/ 531]             blk.15.ssm_conv1d.weight - [    4,  8192,     1,     1], type =    f32, size =    0.125 MB
[ 255/ 531]            blk.15.ssm_dt_norm.weight - [  256,     1,     1,     1], type =    f32, size =    0.001 MB
[ 256/ 531]                   blk.15.ssm_dt.bias - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[ 257/ 531]                 blk.15.ssm_dt.weight - [  256,  8192,     1,     1], type =    f32, size =    8.000 MB
[ 258/ 531]                 blk.15.ssm_in.weight - [ 4096, 16384,     1,     1], type =    f16, converting to q6_K .. size =   128.00 MiB ->    52.50 MiB
[ 259/ 531]                blk.15.ssm_out.weight - [ 8192,  4096,     1,     1], type =    f16, converting to q6_K .. size =    64.00 MiB ->    26.25 MiB
[ 260/ 531]                  blk.15.ssm_x.weight - [ 8192,   288,     1,     1], type =    f32, size =    9.000 MB
[ 261/ 531]          blk.15.ffn_down_exps.weight - [14336,  4096,    16,     1], type =    f16, converting to q6_K .. size =  1792.00 MiB ->   735.00 MiB
[ 262/ 531]          blk.15.ffn_gate_exps.weight - [ 4096, 14336,    16,     1], type =    f16, converting to q6_K .. size =  1792.00 MiB ->   735.00 MiB
[ 263/ 531]            blk.15.ffn_up_exps.weight - [ 4096, 14336,    16,     1], type =    f16, converting to q6_K .. size =  1792.00 MiB ->   735.00 MiB
[ 264/ 531]              blk.15.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 265/ 531]               blk.15.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 266/ 531]               blk.16.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q6_K .. size =   112.00 MiB ->    45.94 MiB
[ 267/ 531]               blk.16.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q6_K .. size =   112.00 MiB ->    45.94 MiB
[ 268/ 531]                 blk.16.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q6_K .. size =   112.00 MiB ->    45.94 MiB
[ 269/ 531]              blk.16.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 270/ 531]                         blk.16.ssm_a - [   16,  8192,     1,     1], type =    f32, size =    0.500 MB
[ 271/ 531]                         blk.16.ssm_d - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[ 272/ 531]             blk.16.ssm_b_norm.weight - [   16,     1,     1,     1], type =    f32, size =    0.000 MB
[ 273/ 531]             blk.16.ssm_c_norm.weight - [   16,     1,     1,     1], type =    f32, size =    0.000 MB
[ 274/ 531]               blk.16.ssm_conv1d.bias - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[ 275/ 531]             blk.16.ssm_conv1d.weight - [    4,  8192,     1,     1], type =    f32, size =    0.125 MB
[ 276/ 531]            blk.16.ssm_dt_norm.weight - [  256,     1,     1,     1], type =    f32, size =    0.001 MB
[ 277/ 531]                   blk.16.ssm_dt.bias - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[ 278/ 531]                 blk.16.ssm_dt.weight - [  256,  8192,     1,     1], type =    f32, size =    8.000 MB
[ 279/ 531]                 blk.16.ssm_in.weight - [ 4096, 16384,     1,     1], type =    f16, converting to q6_K .. size =   128.00 MiB ->    52.50 MiB
[ 280/ 531]                blk.16.ssm_out.weight - [ 8192,  4096,     1,     1], type =    f16, converting to q6_K .. size =    64.00 MiB ->    26.25 MiB
[ 281/ 531]                  blk.16.ssm_x.weight - [ 8192,   288,     1,     1], type =    f32, size =    9.000 MB
[ 282/ 531]               blk.16.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 283/ 531]           blk.17.ffn_gate_inp.weight - [ 4096,    16,     1,     1], type =    f32, size =    0.250 MB
[ 284/ 531]                         blk.17.ssm_a - [   16,  8192,     1,     1], type =    f32, size =    0.500 MB
[ 285/ 531]                         blk.17.ssm_d - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[ 286/ 531]             blk.17.ssm_b_norm.weight - [   16,     1,     1,     1], type =    f32, size =    0.000 MB
[ 287/ 531]             blk.17.ssm_c_norm.weight - [   16,     1,     1,     1], type =    f32, size =    0.000 MB
[ 288/ 531]               blk.17.ssm_conv1d.bias - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[ 289/ 531]             blk.17.ssm_conv1d.weight - [    4,  8192,     1,     1], type =    f32, size =    0.125 MB
[ 290/ 531]            blk.17.ssm_dt_norm.weight - [  256,     1,     1,     1], type =    f32, size =    0.001 MB
[ 291/ 531]                   blk.17.ssm_dt.bias - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[ 292/ 531]                 blk.17.ssm_dt.weight - [  256,  8192,     1,     1], type =    f32, size =    8.000 MB
[ 293/ 531]                 blk.17.ssm_in.weight - [ 4096, 16384,     1,     1], type =    f16, converting to q6_K .. size =   128.00 MiB ->    52.50 MiB
[ 294/ 531]                blk.17.ssm_out.weight - [ 8192,  4096,     1,     1], type =    f16, converting to q6_K .. size =    64.00 MiB ->    26.25 MiB
[ 295/ 531]                  blk.17.ssm_x.weight - [ 8192,   288,     1,     1], type =    f32, size =    9.000 MB
ze =  1792.00 MiB ->   735.00 MiB
[ 297/ 531]          blk.17.ffn_gate_exps.weight - [ 4096, 14336,    16,     1], type =    f16, converting to q6_K .. size =  1792.00 MiB ->   735.00 MiB
[ 298/ 531]            blk.17.ffn_up_exps.weight - [ 4096, 14336,    16,     1], type =    f16, converting to q6_K .. size =  1792.00 MiB ->   735.00 MiB
[ 299/ 531]              blk.17.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 300/ 531]               blk.17.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 301/ 531]               blk.18.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q6_K .. size =   112.00 MiB ->    45.94 MiB
[ 302/ 531]               blk.18.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q6_K .. size =   112.00 MiB ->    45.94 MiB
[ 303/ 531]                 blk.18.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q6_K .. size =   112.00 MiB ->    45.94 MiB
[ 304/ 531]              blk.18.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 305/ 531]                         blk.18.ssm_a - [   16,  8192,     1,     1], type =    f32, size =    0.500 MB
[ 306/ 531]                         blk.18.ssm_d - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[ 307/ 531]             blk.18.ssm_b_norm.weight - [   16,     1,     1,     1], type =    f32, size =    0.000 MB
[ 308/ 531]             blk.18.ssm_c_norm.weight - [   16,     1,     1,     1], type =    f32, size =    0.000 MB
[ 309/ 531]               blk.18.ssm_conv1d.bias - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[ 310/ 531]             blk.18.ssm_conv1d.weight - [    4,  8192,     1,     1], type =    f32, size =    0.125 MB
[ 311/ 531]            blk.18.ssm_dt_norm.weight - [  256,     1,     1,     1], type =    f32, size =    0.001 MB
[ 312/ 531]                   blk.18.ssm_dt.bias - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[ 313/ 531]                 blk.18.ssm_dt.weight - [  256,  8192,     1,     1], type =    f32, size =    8.000 MB
[ 314/ 531]                 blk.18.ssm_in.weight - [ 4096, 16384,     1,     1], type =    f16, converting to q6_K .. size =   128.00 MiB ->    52.50 MiB
[ 315/ 531]                blk.18.ssm_out.weight - [ 8192,  4096,     1,     1], type =    f16, converting to q6_K .. size =    64.00 MiB ->    26.25 MiB
[ 316/ 531]                  blk.18.ssm_x.weight - [ 8192,   288,     1,     1], type =    f32, size =    9.000 MB
[ 317/ 531]               blk.18.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 318/ 531]           blk.19.ffn_gate_inp.weight - [ 4096,    16,     1,     1], type =    f32, size =    0.250 MB
[ 319/ 531]                         blk.19.ssm_a - [   16,  8192,     1,     1], type =    f32, size =    0.500 MB
[ 320/ 531]                         blk.19.ssm_d - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[ 321/ 531]             blk.19.ssm_b_norm.weight - [   16,     1,     1,     1], type =    f32, size =    0.000 MB
[ 322/ 531]             blk.19.ssm_c_norm.weight - [   16,     1,     1,     1], type =    f32, size =    0.000 MB
[ 323/ 531]               blk.19.ssm_conv1d.bias - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[ 324/ 531]             blk.19.ssm_conv1d.weight - [    4,  8192,     1,     1], type =    f32, size =    0.125 MB
[ 325/ 531]            blk.19.ssm_dt_norm.weight - [  256,     1,     1,     1], type =    f32, size =    0.001 MB
[ 326/ 531]                   blk.19.ssm_dt.bias - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[ 327/ 531]                 blk.19.ssm_dt.weight - [  256,  8192,     1,     1], type =    f32, size =    8.000 MB
[ 328/ 531]                 blk.19.ssm_in.weight - [ 4096, 16384,     1,     1], type =    f16, converting to q6_K .. size =   128.00 MiB ->    52.50 MiB
[ 329/ 531]                blk.19.ssm_out.weight - [ 8192,  4096,     1,     1], type =    f16, converting to q6_K .. size =    64.00 MiB ->    26.25 MiB
[ 330/ 531]                  blk.19.ssm_x.weight - [ 8192,   288,     1,     1], type =    f32, size =    9.000 MB
[ 331/ 531]          blk.19.ffn_down_exps.weight - [14336,  4096,    16,     1], type =    f16, converting to q6_K .. size =  1792.00 MiB ->   735.00 MiB
[ 332/ 531]          blk.19.ffn_gate_exps.weight - [ 4096, 14336,    16,     1], type =    f16, converting to q6_K .. size =  1792.00 MiB ->   735.00 MiB
[ 333/ 531]            blk.19.ffn_up_exps.weight - [ 4096, 14336,    16,     1], type =    f16, converting to q6_K .. size =  1792.00 MiB ->   735.00 MiB
[ 334/ 531]              blk.19.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 335/ 531]               blk.19.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 336/ 531]               blk.20.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q6_K .. size =   112.00 MiB ->    45.94 MiB
[ 337/ 531]               blk.20.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q6_K .. size =   112.00 MiB ->    45.94 MiB
[ 338/ 531]                 blk.20.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q6_K .. size =   112.00 MiB ->    45.94 MiB
[ 339/ 531]              blk.20.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 340/ 531]               blk.20.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 341/ 531]                 blk.20.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q6_K .. size =     8.00 MiB ->     3.28 MiB
[ 342/ 531]            blk.20.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q6_K .. size =    32.00 MiB ->    13.12 MiB
[ 343/ 531]                 blk.20.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q6_K .. size =    32.00 MiB ->    13.12 MiB
[ 344/ 531]                 blk.20.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q6_K .. size =     8.00 MiB ->     3.28 MiB
[ 345/ 531]           blk.21.ffn_gate_inp.weight - [ 4096,    16,     1,     1], type =    f32, size =    0.250 MB
[ 346/ 531]                         blk.21.ssm_a - [   16,  8192,     1,     1], type =    f32, size =    0.500 MB
[ 347/ 531]                         blk.21.ssm_d - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[ 348/ 531]             blk.21.ssm_b_norm.weight - [   16,     1,     1,     1], type =    f32, size =    0.000 MB
[ 349/ 531]             blk.21.ssm_c_norm.weight - [   16,     1,     1,     1], type =    f32, size =    0.000 MB
[ 350/ 531]               blk.21.ssm_conv1d.bias - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[ 351/ 531]             blk.21.ssm_conv1d.weight - [    4,  8192,     1,     1], type =    f32, size =    0.125 MB
[ 352/ 531]            blk.21.ssm_dt_norm.weight - [  256,     1,     1,     1], type =    f32, size =    0.001 MB
[ 353/ 531]                   blk.21.ssm_dt.bias - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[ 354/ 531]                 blk.21.ssm_dt.weight - [  256,  8192,     1,     1], type =    f32, size =    8.000 MB
[ 355/ 531]                 blk.21.ssm_in.weight - [ 4096, 16384,     1,     1], type =    f16, converting to q6_K .. size =   128.00 MiB ->    52.50 MiB
[ 356/ 531]                blk.21.ssm_out.weight - [ 8192,  4096,     1,     1], type =    f16, converting to q6_K .. size =    64.00 MiB ->    26.25 MiB
[ 357/ 531]                  blk.21.ssm_x.weight - [ 8192,   288,     1,     1], type =    f32, size =    9.000 MB
[ 358/ 531]          blk.21.ffn_down_exps.weight - [14336,  4096,    16,     1], type =    f16, converting to q6_K .. size =  1792.00 MiB ->   735.00 MiB
[ 359/ 531]          blk.21.ffn_gate_exps.weight - [ 4096, 14336,    16,     1], type =    f16, converting to q6_K .. size =  1792.00 MiB ->   735.00 MiB
[ 360/ 531]            blk.21.ffn_up_exps.weight - [ 4096, 14336,    16,     1], type =    f16, converting to q6_K .. size =  1792.00 MiB ->   735.00 MiB
[ 361/ 531]              blk.21.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 362/ 531]               blk.21.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 363/ 531]               blk.22.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q6_K .. size =   112.00 MiB ->    45.94 MiB
[ 364/ 531]               blk.22.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q6_K .. size =   112.00 MiB ->    45.94 MiB
[ 365/ 531]                 blk.22.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q6_K .. size =   112.00 MiB ->    45.94 MiB
[ 366/ 531]              blk.22.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 367/ 531]                         blk.22.ssm_a - [   16,  8192,     1,     1], type =    f32, size =    0.500 MB
[ 368/ 531]                         blk.22.ssm_d - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[ 369/ 531]             blk.22.ssm_b_norm.weight - [   16,     1,     1,     1], type =    f32, size =    0.000 MB
[ 370/ 531]             blk.22.ssm_c_norm.weight - [   16,     1,     1,     1], type =    f32, size =    0.000 MB
[ 371/ 531]               blk.22.ssm_conv1d.bias - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[ 372/ 531]             blk.22.ssm_conv1d.weight - [    4,  8192,     1,     1], type =    f32, size =    0.125 MB
[ 373/ 531]            blk.22.ssm_dt_norm.weight - [  256,     1,     1,     1], type =    f32, size =    0.001 MB
[ 374/ 531]                   blk.22.ssm_dt.bias - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[ 375/ 531]                 blk.22.ssm_dt.weight - [  256,  8192,     1,     1], type =    f32, size =    8.000 MB
[ 376/ 531]                 blk.22.ssm_in.weight - [ 4096, 16384,     1,     1], type =    f16, converting to q6_K .. size =   128.00 MiB ->    52.50 MiB
[ 377/ 531]                blk.22.ssm_out.weight - [ 8192,  4096,     1,     1], type =    f16, converting to q6_K .. size =    64.00 MiB ->    26.25 MiB
[ 378/ 531]                  blk.22.ssm_x.weight - [ 8192,   288,     1,     1], type =    f32, size =    9.000 MB
[ 379/ 531]               blk.22.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 380/ 531]           blk.23.ffn_gate_inp.weight - [ 4096,    16,     1,     1], type =    f32, size =    0.250 MB
[ 381/ 531]                         blk.23.ssm_a - [   16,  8192,     1,     1], type =    f32, size =    0.500 MB
[ 382/ 531]                         blk.23.ssm_d - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[ 383/ 531]             blk.23.ssm_b_norm.weight - [   16,     1,     1,     1], type =    f32, size =    0.000 MB
[ 384/ 531]             blk.23.ssm_c_norm.weight - [   16,     1,     1,     1], type =    f32, size =    0.000 MB
[ 385/ 531]               blk.23.ssm_conv1d.bias - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[ 386/ 531]             blk.23.ssm_conv1d.weight - [    4,  8192,     1,     1], type =    f32, size =    0.125 MB
[ 387/ 531]            blk.23.ssm_dt_norm.weight - [  256,     1,     1,     1], type =    f32, size =    0.001 MB
[ 388/ 531]                   blk.23.ssm_dt.bias - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[ 389/ 531]                 blk.23.ssm_dt.weight - [  256,  8192,     1,     1], type =    f32, size =    8.000 MB
[ 390/ 531]                 blk.23.ssm_in.weight - [ 4096, 16384,     1,     1], type =    f16, converting to q6_K .. size =   128.00 MiB ->    52.50 MiB
[ 391/ 531]                blk.23.ssm_out.weight - [ 8192,  4096,     1,     1], type =    f16, converting to q6_K .. size =    64.00 MiB ->    26.25 MiB
[ 392/ 531]                  blk.23.ssm_x.weight - [ 8192,   288,     1,     1], type =    f32, size =    9.000 MB
[ 393/ 531]          blk.23.ffn_down_exps.weight - [14336,  4096,    16,     1], type =    f16, converting to q6_K .. size =  1792.00 MiB ->   735.00 MiB
[ 394/ 531]          blk.23.ffn_gate_exps.weight - [ 4096, 14336,    16,     1], type =    f16, converting to q6_K .. size =  1792.00 MiB ->   735.00 MiB
[ 395/ 531]            blk.23.ffn_up_exps.weight - [ 4096, 14336,    16,     1], type =    f16, converting to q6_K .. size =  1792.00 MiB ->   735.00 MiB
[ 396/ 531]              blk.23.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 397/ 531]               blk.23.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 398/ 531]               blk.24.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q6_K .. size =   112.00 MiB ->    45.94 MiB
[ 399/ 531]               blk.24.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q6_K .. size =   112.00 MiB ->    45.94 MiB
[ 400/ 531]                 blk.24.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q6_K .. size =   112.00 MiB ->    45.94 MiB
[ 401/ 531]              blk.24.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 402/ 531]                         blk.24.ssm_a - [   16,  8192,     1,     1], type =    f32, size =    0.500 MB
[ 403/ 531]                         blk.24.ssm_d - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[ 404/ 531]             blk.24.ssm_b_norm.weight - [   16,     1,     1,     1], type =    f32, size =    0.000 MB
[ 405/ 531]             blk.24.ssm_c_norm.weight - [   16,     1,     1,     1], type =    f32, size =    0.000 MB
[ 406/ 531]               blk.24.ssm_conv1d.bias - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[ 407/ 531]             blk.24.ssm_conv1d.weight - [    4,  8192,     1,     1], type =    f32, size =    0.125 MB
[ 408/ 531]            blk.24.ssm_dt_norm.weight - [  256,     1,     1,     1], type =    f32, size =    0.001 MB
[ 409/ 531]                   blk.24.ssm_dt.bias - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[ 410/ 531]                 blk.24.ssm_dt.weight - [  256,  8192,     1,     1], type =    f32, size =    8.000 MB
[ 411/ 531]                 blk.24.ssm_in.weight - [ 4096, 16384,     1,     1], type =    f16, converting to q6_K .. size =   128.00 MiB ->    52.50 MiB
[ 412/ 531]                blk.24.ssm_out.weight - [ 8192,  4096,     1,     1], type =    f16, converting to q6_K .. size =    64.00 MiB ->    26.25 MiB
[ 413/ 531]                  blk.24.ssm_x.weight - [ 8192,   288,     1,     1], type =    f32, size =    9.000 MB
[ 414/ 531]               blk.24.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 415/ 531]           blk.25.ffn_gate_inp.weight - [ 4096,    16,     1,     1], type =    f32, size =    0.250 MB
[ 416/ 531]                         blk.25.ssm_a - [   16,  8192,     1,     1], type =    f32, size =    0.500 MB
[ 417/ 531]                         blk.25.ssm_d - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[ 418/ 531]             blk.25.ssm_b_norm.weight - [   16,     1,     1,     1], type =    f32, size =    0.000 MB
[ 419/ 531]             blk.25.ssm_c_norm.weight - [   16,     1,     1,     1], type =    f32, size =    0.000 MB
[ 420/ 531]               blk.25.ssm_conv1d.bias - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[ 421/ 531]             blk.25.ssm_conv1d.weight - [    4,  8192,     1,     1], type =    f32, size =    0.125 MB
[ 422/ 531]            blk.25.ssm_dt_norm.weight - [  256,     1,     1,     1], type =    f32, size =    0.001 MB
[ 423/ 531]                   blk.25.ssm_dt.bias - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[ 424/ 531]                 blk.25.ssm_dt.weight - [  256,  8192,     1,     1], type =    f32, size =    8.000 MB
[ 425/ 531]                 blk.25.ssm_in.weight - [ 4096, 16384,     1,     1], type =    f16, converting to q6_K .. size =   128.00 MiB ->    52.50 MiB
[ 426/ 531]                blk.25.ssm_out.weight - [ 8192,  4096,     1,     1], type =    f16, converting to q6_K .. size =    64.00 MiB ->    26.25 MiB
[ 427/ 531]                  blk.25.ssm_x.weight - [ 8192,   288,     1,     1], type =    f32, size =    9.000 MB
[ 428/ 531]          blk.25.ffn_down_exps.weight - [14336,  4096,    16,     1], type =    f16, converting to q6_K .. size =  1792.00 MiB ->   735.00 MiB
[ 429/ 531]          blk.25.ffn_gate_exps.weight - [ 4096, 14336,    16,     1], type =    f16, converting to q6_K .. size =  1792.00 MiB ->   735.00 MiB
[ 430/ 531]            blk.25.ffn_up_exps.weight - [ 4096, 14336,    16,     1], type =    f16, converting to q6_K .. size =  1792.00 MiB ->   735.00 MiB
[ 431/ 531]              blk.25.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 432/ 531]               blk.25.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 433/ 531]                         blk.26.ssm_a - [   16,  8192,     1,     1], type =    f32, size =    0.500 MB
[ 434/ 531]                         blk.26.ssm_d - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[ 435/ 531]             blk.26.ssm_b_norm.weight - [   16,     1,     1,     1], type =    f32, size =    0.000 MB
[ 436/ 531]             blk.26.ssm_c_norm.weight - [   16,     1,     1,     1], type =    f32, size =    0.000 MB
[ 437/ 531]               blk.26.ssm_conv1d.bias - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[ 438/ 531]             blk.26.ssm_conv1d.weight - [    4,  8192,     1,     1], type =    f32, size =    0.125 MB
[ 439/ 531]            blk.26.ssm_dt_norm.weight - [  256,     1,     1,     1], type =    f32, size =    0.001 MB
[ 440/ 531]                   blk.26.ssm_dt.bias - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[ 441/ 531]                 blk.26.ssm_dt.weight - [  256,  8192,     1,     1], type =    f32, size =    8.000 MB
[ 442/ 531]                 blk.26.ssm_in.weight - [ 4096, 16384,     1,     1], type =    f16, converting to q6_K .. size =   128.00 MiB ->    52.50 MiB
[ 443/ 531]                blk.26.ssm_out.weight - [ 8192,  4096,     1,     1], type =    f16, converting to q6_K .. size =    64.00 MiB ->    26.25 MiB
[ 444/ 531]                  blk.26.ssm_x.weight - [ 8192,   288,     1,     1], type =    f32, size =    9.000 MB
[ 445/ 531]               blk.26.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q6_K .. size =   112.00 MiB ->    45.94 MiB
[ 446/ 531]               blk.26.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q6_K .. size =   112.00 MiB ->    45.94 MiB
[ 447/ 531]                 blk.26.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q6_K .. size =   112.00 MiB ->    45.94 MiB
[ 448/ 531]              blk.26.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 449/ 531]               blk.26.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 450/ 531]           blk.27.ffn_gate_inp.weight - [ 4096,    16,     1,     1], type =    f32, size =    0.250 MB
[ 451/ 531]                         blk.27.ssm_a - [   16,  8192,     1,     1], type =    f32, size =    0.500 MB
[ 452/ 531]                         blk.27.ssm_d - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[ 453/ 531]             blk.27.ssm_b_norm.weight - [   16,     1,     1,     1], type =    f32, size =    0.000 MB
[ 454/ 531]             blk.27.ssm_c_norm.weight - [   16,     1,     1,     1], type =    f32, size =    0.000 MB
[ 455/ 531]               blk.27.ssm_conv1d.bias - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[ 456/ 531]             blk.27.ssm_conv1d.weight - [    4,  8192,     1,     1], type =    f32, size =    0.125 MB
[ 457/ 531]            blk.27.ssm_dt_norm.weight - [  256,     1,     1,     1], type =    f32, size =    0.001 MB
[ 458/ 531]                   blk.27.ssm_dt.bias - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[ 459/ 531]                 blk.27.ssm_dt.weight - [  256,  8192,     1,     1], type =    f32, size =    8.000 MB
[ 460/ 531]                 blk.27.ssm_in.weight - [ 4096, 16384,     1,     1], type =    f16, converting to q6_K .. size =   128.00 MiB ->    52.50 MiB
[ 461/ 531]                blk.27.ssm_out.weight - [ 8192,  4096,     1,     1], type =    f16, converting to q6_K .. size =    64.00 MiB ->    26.25 MiB
[ 462/ 531]                  blk.27.ssm_x.weight - [ 8192,   288,     1,     1], type =    f32, size =    9.000 MB
[ 463/ 531]          blk.27.ffn_down_exps.weight - [14336,  4096,    16,     1], type =    f16, converting to q6_K .. size =  1792.00 MiB ->   735.00 MiB
[ 464/ 531]          blk.27.ffn_gate_exps.weight - [ 4096, 14336,    16,     1], type =    f16, converting to q6_K .. size =  1792.00 MiB ->   735.00 MiB
[ 465/ 531]            blk.27.ffn_up_exps.weight - [ 4096, 14336,    16,     1], type =    f16, converting to q6_K .. size =  1792.00 MiB ->   735.00 MiB
[ 466/ 531]              blk.27.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 467/ 531]               blk.27.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 468/ 531]               blk.28.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q6_K .. size =   112.00 MiB ->    45.94 MiB
[ 469/ 531]               blk.28.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q6_K .. size =   112.00 MiB ->    45.94 MiB
[ 470/ 531]                 blk.28.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q6_K .. size =   112.00 MiB ->    45.94 MiB
[ 471/ 531]              blk.28.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 472/ 531]               blk.28.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 473/ 531]                 blk.28.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q6_K .. size =     8.00 MiB ->     3.28 MiB
[ 474/ 531]            blk.28.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q6_K .. size =    32.00 MiB ->    13.12 MiB
[ 475/ 531]                 blk.28.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q6_K .. size =    32.00 MiB ->    13.12 MiB
[ 476/ 531]                 blk.28.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q6_K .. size =     8.00 MiB ->     3.28 MiB
[ 477/ 531]           blk.29.ffn_gate_inp.weight - [ 4096,    16,     1,     1], type =    f32, size =    0.250 MB
[ 478/ 531]                         blk.29.ssm_a - [   16,  8192,     1,     1], type =    f32, size =    0.500 MB
[ 479/ 531]                         blk.29.ssm_d - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[ 480/ 531]             blk.29.ssm_b_norm.weight - [   16,     1,     1,     1], type =    f32, size =    0.000 MB
[ 481/ 531]             blk.29.ssm_c_norm.weight - [   16,     1,     1,     1], type =    f32, size =    0.000 MB
[ 482/ 531]               blk.29.ssm_conv1d.bias - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[ 483/ 531]             blk.29.ssm_conv1d.weight - [    4,  8192,     1,     1], type =    f32, size =    0.125 MB
[ 484/ 531]            blk.29.ssm_dt_norm.weight - [  256,     1,     1,     1], type =    f32, size =    0.001 MB
[ 485/ 531]                   blk.29.ssm_dt.bias - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[ 486/ 531]                 blk.29.ssm_dt.weight - [  256,  8192,     1,     1], type =    f32, size =    8.000 MB
[ 487/ 531]                 blk.29.ssm_in.weight - [ 4096, 16384,     1,     1], type =    f16, converting to q6_K .. size =   128.00 MiB ->    52.50 MiB
[ 488/ 531]                blk.29.ssm_out.weight - [ 8192,  4096,     1,     1], type =    f16, converting to q6_K .. size =    64.00 MiB ->    26.25 MiB
[ 489/ 531]                  blk.29.ssm_x.weight - [ 8192,   288,     1,     1], type =    f32, size =    9.000 MB
[ 490/ 531]          blk.29.ffn_down_exps.weight - [14336,  4096,    16,     1], type =    f16, converting to q6_K .. size =  1792.00 MiB ->   735.00 MiB
[ 491/ 531]          blk.29.ffn_gate_exps.weight - [ 4096, 14336,    16,     1], type =    f16, converting to q6_K .. size =  1792.00 MiB ->   735.00 MiB
[ 492/ 531]            blk.29.ffn_up_exps.weight - [ 4096, 14336,    16,     1], type =    f16, converting to q6_K .. size =  1792.00 MiB ->   735.00 MiB
[ 493/ 531]              blk.29.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 494/ 531]               blk.29.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 495/ 531]               blk.30.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q6_K .. size =   112.00 MiB ->    45.94 MiB
[ 496/ 531]               blk.30.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q6_K .. size =   112.00 MiB ->    45.94 MiB
[ 497/ 531]                 blk.30.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q6_K .. size =   112.00 MiB ->    45.94 MiB
[ 498/ 531]              blk.30.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 499/ 531]                         blk.30.ssm_a - [   16,  8192,     1,     1], type =    f32, size =    0.500 MB
[ 500/ 531]                         blk.30.ssm_d - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[ 501/ 531]             blk.30.ssm_b_norm.weight - [   16,     1,     1,     1], type =    f32, size =    0.000 MB
[ 502/ 531]             blk.30.ssm_c_norm.weight - [   16,     1,     1,     1], type =    f32, size =    0.000 MB
[ 503/ 531]               blk.30.ssm_conv1d.bias - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[ 504/ 531]             blk.30.ssm_conv1d.weight - [    4,  8192,     1,     1], type =    f32, size =    0.125 MB
[ 505/ 531]            blk.30.ssm_dt_norm.weight - [  256,     1,     1,     1], type =    f32, size =    0.001 MB
[ 506/ 531]                   blk.30.ssm_dt.bias - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[ 507/ 531]                 blk.30.ssm_dt.weight - [  256,  8192,     1,     1], type =    f32, size =    8.000 MB
[ 508/ 531]                 blk.30.ssm_in.weight - [ 4096, 16384,     1,     1], type =    f16, converting to q6_K .. size =   128.00 MiB ->    52.50 MiB
[ 509/ 531]                blk.30.ssm_out.weight - [ 8192,  4096,     1,     1], type =    f16, converting to q6_K .. size =    64.00 MiB ->    26.25 MiB
[ 510/ 531]                  blk.30.ssm_x.weight - [ 8192,   288,     1,     1], type =    f32, size =    9.000 MB
[ 511/ 531]               blk.30.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 512/ 531]           blk.31.ffn_gate_inp.weight - [ 4096,    16,     1,     1], type =    f32, size =    0.250 MB
[ 513/ 531]                         blk.31.ssm_a - [   16,  8192,     1,     1], type =    f32, size =    0.500 MB
[ 514/ 531]                         blk.31.ssm_d - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[ 515/ 531]             blk.31.ssm_b_norm.weight - [   16,     1,     1,     1], type =    f32, size =    0.000 MB
[ 516/ 531]             blk.31.ssm_c_norm.weight - [   16,     1,     1,     1], type =    f32, size =    0.000 MB
[ 517/ 531]               blk.31.ssm_conv1d.bias - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[ 518/ 531]             blk.31.ssm_conv1d.weight - [    4,  8192,     1,     1], type =    f32, size =    0.125 MB
[ 519/ 531]            blk.31.ssm_dt_norm.weight - [  256,     1,     1,     1], type =    f32, size =    0.001 MB
[ 520/ 531]                   blk.31.ssm_dt.bias - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[ 521/ 531]                 blk.31.ssm_dt.weight - [  256,  8192,     1,     1], type =    f32, size =    8.000 MB
[ 522/ 531]                 blk.31.ssm_in.weight - [ 4096, 16384,     1,     1], type =    f16, converting to q6_K .. size =   128.00 MiB ->    52.50 MiB
[ 523/ 531]                blk.31.ssm_out.weight - [ 8192,  4096,     1,     1], type =    f16, converting to q6_K .. size =    64.00 MiB ->    26.25 MiB
[ 524/ 531]                  blk.31.ssm_x.weight - [ 8192,   288,     1,     1], type =    f32, size =    9.000 MB
[ 525/ 531]                        output.weight - [ 4096, 65536,     1,     1], type =    f16, converting to q6_K .. size =   512.00 MiB ->   210.00 MiB
[ 526/ 531]                   output_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 527/ 531]          blk.31.ffn_down_exps.weight - [14336,  4096,    16,     1], type =    f16, converting to q6_K .. size =  1792.00 MiB ->   735.00 MiB
[ 528/ 531]          blk.31.ffn_gate_exps.weight - [ 4096, 14336,    16,     1], type =    f16, converting to q6_K .. size =  1792.00 MiB ->   735.00 MiB
[ 529/ 531]            blk.31.ffn_up_exps.weight - [ 4096, 14336,    16,     1], type =    f16, converting to q6_K .. size =  1792.00 MiB ->   735.00 MiB
[ 530/ 531]              blk.31.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 531/ 531]               blk.31.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
llama_model_quantize_internal: model size  = 98613.17 MB
llama_model_quantize_internal: quant size  = 40742.42 MB

main: quantize time = 841429.95 ms
main:    total time = 841429.95 ms

May 29 '24 01:05 severian42

I'm currently working on a big refactor of how Mamba (and Jamba) works to make all sequences of a sub-batch be of the same length (initially only for models with recurrent states), and to make recurrent state slots contiguous, with the goal of simplifying the SSM operations ~~(and removing GGML_OP_SSM_CONV)~~ (EDIT: didn't yet find a memory-efficient way to get rid of it), so that GPU support will be much easier to implement after that.

It will also remove unnecessary complexity related to SSM ops like inp_s_seq.

Right now I've got working equal-sequence-length sub-batch splitting, I've started simplifying the SSM operations, and I'm working on a simpler way to allocate recurrent state slots. Not yet pushed here, because it's very WIP. It's going to take at least a few days, but I think it's worth it.

May 29 '24 04:05 compilade

I've pushed the refactor to use equal-sequence-length sub-batch splitting for recurrent models. This greatly simplifies the SSM operations, no need for inp_s_seq anymore. And recurrent state slot allocation is now always contiguous, defragmenting itself transparently. This is mostly internal changes, nothing user-facing should really have changed. Performance is not better, but it's also not measurably worse, so that's good.

I've tested that the outputs remain the same for recurrent, transformer, and hybrid models, both with a single sequence, and multiple in parallel. I have also tested embeddings (with bge-small) with this, but not in parallel.

I didn't manage to get rid of GGML_OP_SSM_CONV, because the most straightforward way would quadruple the compute buffer size, scaling up with the batch size.

I think it could be possible to get rid of both GGML_OP_SSM_CONV and GGML_OP_SSM_SCAN, but it would need some kind of looping construct in the compute node graph to keep the number of nodes constant when processing sequences longer than 1 token per ubatch. But this might be a bad idea, especially if it complicates graph dependencies and/or allocation too much.

Jun 01 '24 16:06 compilade

Will try to focus on this PR in the next days in order to merge it and try to add Metal GPU support

Jun 10 '24 12:06 ggerganov

@compilade Do you have local changes in this branch? Would like to merge latest master here

Jun 12 '24 13:06 ggerganov

@compilade Do you have local changes in this branch? Would like to merge latest master here

@ggerganov I do have local changes, which I've pushed now. I was in the process of testing them to see if anything broke. I'm getting slightly different text from https://github.com/ggerganov/llama.cpp/pull/7531/commits/fee3c1d740c0e027c81e2f2f3fb48d619857175f with Jamba-900M, but this might be due to the changes from https://github.com/ggerganov/llama.cpp/pull/7685 regarding special token handling, or not. Jamba doesn't use RoPE, and I've also tried with cmake .. -DLLAMA_OPENMP=FALSE too. Might be something else; there's no difference at all for Mamba, and I only saw a difference with Jamba-900M with the main example with -p "I believe the meaning of life is" --temp 0 --repeat-penalty 1.2 --repeat-last-n 256 after 20 tokens, so it could likely be nothing to worry about.

Jun 12 '24 17:06 compilade

The change is quite big and I'm having a bit of trouble to merge it all at once. Wonder if we should take a more step-by-step approach. The ggml changes alone are good - could these be merged alone and used for the existing mamba-only implementation on master? If yes, then it will allow to start the GPU implementation separately. Maybe after that try to decompose the KV cache changes somehow. Probably after refactoring the code a bit to prepare for this change and splitting the llama.cpp code into more source files.

Jun 16 '24 07:06 ggerganov

The change is quite big and I'm having a bit of trouble to merge it all at once. Wonder if we should take a more step-by-step approach.

I agree that this is quite big. Sorry about that. I'll see what I can do.

The ggml changes alone are good - could these be merged alone and used for the existing mamba-only implementation on master?

Unfortunately, the ggml changes to the mamba-related operators depend on equal sequence length u-batches and contiguous (and ordered) allocation for recurrent states. It might still be possible to extract enough of the new behavior onto the current way recurrent states are managed on master, or not. I'll look into ways to do this.

I think I might be able to separate some parts of this PR.

These are the main separable parts:

Variable GQA support
- makes {arch}.attention.head_count_kv also capable of being an array of integers
- Isn't really used outside of DeciLM and hybrid models. I originally added it to simplify the allocation of the KV cache to reserve space only for the layers that need it in Jamba, and also to identify which layers use Attention and which don't in Jamba. This seemed like a good way to solve three problems at once.
Advanced batch splits
- Can be useful on its own, since the buffers it adds eliminates the need for extra allocations in llama_decode_internal when llama_batch_get_one is used.
ggml improvements to GGML_OP_SSM_CONV and GGML_OP_SSM_SCAN
- depends on equal-sequence-length u-batches and contiguous (and ordered) recurrent state slot allocation.
  - There might be a way to retro-fit contiguous allocation on the old way the KV cache was re-used for recurrent states. I'll need to think more about this.
Separate recurrent state cache from the KV cache
- This is a big one, since this includes the (maybe over-engineered) recurrent state management which allows keeping state checkpoints and which makes recurrent state slot allocation always contiguous and use the same order the associated seq_id have in the batch (which benefits from equal-sequence-length u-batch splitting). This also simplifies how copies of cells between sequences are made, since recurrent state cells can now be shared between seq_id with llama_kv_cache_seq_cp/llama_past_seq_cp, while the latest states are unaliased during slot allocation.
- Maybe after that try to decompose the KV cache changes somehow. Probably after refactoring the code a bit to prepare for this change and splitting the llama.cpp code into more source files.
  - I'll think about how to make this more easily manageable. But this is inherently a lot of interlinked changes, since the KV cache API has one more type of cache to manage simultaneously (!) for hybrid models, and some operations get their p0 and/or p1 ranges modified depending on the presence of state checkpoints.
Session file support for the separate recurrent state cache
- This is not yet done
Jamba support
- depends on all of the above

Jun 17 '24 02:06 compilade

Variable GQA support

Could we extend this point a bit more and add support for OpenELM together with it? The PR for OpenELM is almost ready, but has some quick hacks that seem relevant to this: https://github.com/ggerganov/llama.cpp/pull/7359

Jun 17 '24 14:06 ggerganov

Now that variable GQA support is in master (because of #7359 which has been merged), I plan to separate the advanced batch splits feature in its own PR for easier review.

(for some context, this allows splitting batches as described in https://github.com/ggerganov/llama.cpp/pull/7531#discussion_r1620997020, and also single-sequence ubatches, as well as the current simple split used on master)

Jul 04 '24 21:07 compilade

Any updates on this since Jamba 1.5 is now out?

Aug 24 '24 17:08 Autumnlight02

Any updates on this since Jamba 1.5 is now out?

@Autumnlight02

Basically, since https://github.com/ggerganov/llama.cpp/pull/8526 was merged, now I need to resolve a very big merge conflict because I didn't keep the code identical. This will probably take a few days.

Aug 25 '24 13:08 compilade

Some progress update on Jamba:

I began resolving the merge conflicts, and there were at least 2000+ lines of conflicts (basically half of this PR). This is manageable.

While I've solved most of them, the result is not usable (and it doesn't build, and so I did not push it here yet (sorry), I will push once it works) because of the state saving and restoring code which was changed in #8699, and this doesn't yet handle two caches.

My problem right now is with the single-sequence session restoring, which uses llama_kv_cache_find_slot in master, ~~but this won't really work for how transparent recurrent state checkpoints are implemented here~~, so I'm thinking of other ways.

(EDIT: on further thought llama_kv_cache_find_slot can work, but only for a single checkpoint per sequence. This might be sufficient. I'm still leaving the rest of this comment intact because it's still somewhat relevant to know the tradeoffs of the implementation of recurrent state checkpoints)

To make single-sequence session restores simpler, I could either

Keep using llama_kv_cache_find_slot for that because it turns out it's not a problem
- Only realized this after writing this whole comment.
- Would only work to restore a single state checkpoint per sequence.
Throw away state checkpoints and postpone them for a future PR.
- This would simplify everything, but would result in a bad user experience with recurrent and hybrid models due to excessive prompt reprocessing when using llama-server for conversations, because recurrent states can't be rolled back (yet?), and prompt processing has to start back from the beginning when the server removes more than one token at the end (extremely common).
  - This is also currently the situation for purely recurrent models on master (might not be that bad?)
Make llama_kv_cache_defrag defragment the whole KV cache to get an easy contiguous slot at the "end"
- Requires deeply refactoring kv cache defrag to use ggml_get_rows instead of potentially thousands of individual tensor copies (otherwise defragmenting the whole KV cache won't really be doable in one shot)
Store fragmented cache
- Might be more complicated (and sometimes less efficient) than defragmenting beforehand.
Explicitly fail for single-sequence session restore for recurrent (and hybrid) models
- Regression: on master, this currently works for recurrent models since #8526

The least bad option (EDIT: apart from simply using llama_kv_cache_find_slot) seems to be to improve KV cache defragmentation, and again it seems like it could be its own PR. I'll begin working on that.

But I'm starting to think that maybe state checkpoints add too much complexity.

The current implementation uses a unified pool of recurrent state cells to allocate checkpoints and/or current states for each seq_id while ensuring the available cells are allocated fairly to each "used" seq_id. If there are only 2 "used" seq_id but there are 8 allocated cells, then each seq_id will get 4 cells (1 for the "tail" cell, and 3 for the checkpoints). If there's a third seq_id appearing, then they will each get at least 2 cells, while some of them will use the remaining 2 cells. That behavior requires keeping track of a lot of things including the relationship of cells in a tree of sequences. (some cells can be common between sequences, and the count of shared cells is managed differently; the explanation in this paragraph glosses over some details)

Some alternatives:

Manually use one more seq_id (with llama_kv_cache_seq_cp) at each "checkpoint"
- Internally simpler
- Potentially better checkpoint placement
- Harder to manage in the examples (and for 3rd party apps using the KV cache API) than automatic checkpoints
  - When using llama_kv_cache_seq_rm with a partial token range instead of with a whole sequence, the same problems as before apply: with recurrent models, processing the prompt would need to start over from the beginning, and would not detect the most appropriate checkpoint to use (if there was one).
    - Although with automatic checkpoints, this also has to be handled, but "starting over" happens closer to the end of the prompt.
- super-sequences have to be tracked manually
  - seq_id re-use can be complicated
  - slot-ids in llama-server would no longer directly map to seq_ids
- Has to be explicitly managed anywhere it could be useful
  - simpler on the inside but more complicated on the outside
- Not sure if -np should still also be the number of distinct recurrent states, because it's also the slot count in the server.
Calculate states in reverse
- best memory usage
- would be very cool
- not sure it's possible
- not a general solution for all recurrent models
  - needs further research for each recurrent architecture
- would need keeping track of tokens in the cache
Pre-allocate a fixed number of recurrent state checkpoints for each sequence
- Simpler, but not really (contiguous slot allocation could make this more complicated)
- (considering a minimum of 3 checkpoints per sequence is necessary to properly benefit from checkpoints)
- Not ideal memory usage
  - Especially with the dedicated sequence for the system prompt in llama-server and llama-parallel
- Would need some way of specifying the number of checkpoints per sequence (can't be -c or -np, for reasons)

Manual checkpoint management seems tempting, but would offload the complexity to llama-server, which might not be desirable (since in the end it's mostly the same things which need to be tracked).

Meanwhile I will attempt to refactor KV cache defragmentation soon (which should be useful anyway).

Sep 01 '24 03:09 compilade

Regarding the manual checkpoint management - recently, the commonly used APIs in the cloud (e.g. Anthropic, OpenAI, etc) introduced "prompt caching" [0], which adds a "cache control" parameter to the requests. It can be used to cache prompts, but I guess it fits well with the idea of manual recurrent state checkpointing from the user code.

I'm thinking that the changes for Jamba should be kept to a minimum for now, even if this would require longer processing times for common use cases. The reason is that the architecture is not yet well adopted, so increasing the complexity of the codebase to support it is not very justified. The better approach would be to improve the support for the existing transformer and mamba arches, by refactoring the KV cache and state management implementation and adding unit tests. I suppose a large part of the complexity with Jamba comes from the fact that we are trying to fit the logic into the existing KV cache implementation, which is not well-fit for this architecture.

[0] - https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching

Sep 01 '24 07:09 ggerganov

For the first time state saving and reloading works for Jamba (both for the whole state and single-sequences). 🎉

This is implemented in https://github.com/ggerganov/llama.cpp/pull/7531/commits/fcb889cf7fb6588a6565f4cc6373be3f53ff25ca

I'm thinking that the changes for Jamba should be kept to a minimum for now, even if this would require longer processing times for common use cases. The reason is that the architecture is not yet well adopted, so increasing the complexity of the codebase to support it is not very justified.

Agreed. I'll start simplifying the code and will think about how to best approach manual/explicit checkpoints for a future PR. The implicit checkpoints implemented here are a bit over-engineered, and do not fit in the idea of a "minimal" change.

I suppose a large part of the complexity with Jamba comes from the fact that we are trying to fit the logic into the existing KV cache implementation, which is not well-fit for this architecture.

I did not find the existing KV cache implementation to be particularly limiting. Most of the complexity in the Jamba implementation here comes from the allocation of recurrent states and implicit checkpoints. The only necessary complexity needed for Jamba is that both the KV cache and the recurrent state cache should be kept in sync, and even then most of the complexity is in keeping the metadata of the tree of sequences consistent (some of which is only there to allow fairly allocating the cache between seq_ids).

My plan for this PR in the next days/weeks:

Remove implicit recurrent state checkpoints to remove unnecessary complexity (I estimate this will reduce the change by 1000+ lines)
- Proper explicit checkpoint handling (likely with some new API to make it more convenient) is postponed to a future pull-request.
  - I'll still keep the code for implicit checkpoints somewhere because it's possible to make it explicit instead with only a few lines changed
  - I want to explore simpler ways to do this.
- The advantage of explicit checkpoints is that the minimum useful number of states per user is 2 instead of 3, because llama-server "knows" when the stop token (or the beginning of the stop string!) is sampled.
Rename llama_past back to llama_kv_cache (which will still contain both Attention's KV cache and the recurrent state cache)
- llama_kv_cache (which currently contains only Attention's KV cache) will be renamed to
  - llama_kv_self_cache, for self-attention, because even in T5 it's only used for self-attention
- llama_rs_cache (which contains recurrent states) will be renamed to either
  - llama_kv_rect_cache (stands for "RECurrenT")
  - llama_kv_rest_cache (stands for "REcurrent STate", but might be confusing)
  - llama_kv_iter_cache (might be confused with Iterators in the C++ language even if unrelated)
  - llama_rs_cache (same name, no renaming)
  - llama_rs_self_cache (for consistency with kv_self? self has no particular meaning here)
    - Probably what I'll pick
- n_rs and rs_head (vs n_kv and kv_head) will keep their name.
- This will avoid changing the names of the functions of the KV cache API and will let us still call it "the KV cache API"

What will be left intact is:

Jamba support
The KV cache API which also simultaneously manages recurrent states
- Same drawbacks as for Mamba and RWKV-v6 on master, i.e. only one state per sequence (no rollback).
Session saving and reloading for hybrid models

As before, hybrid models of different architectures should be able to work on top of that (like how RWKV-v6 and Mamba can share the same recurrent state management code), as long as it's about hybrids between Attention and some recurrent block. This will mean models like Zamba (Mamba + Attention), Zamba2 (Mamba-2 + Attention), RecurrentGemma (RG-LRU + Attention), and others should be easier to implement without worrying about the KV cache API too much, and they will benefit from future improvements in state checkpoint management.

(Note that mixing different recurrent architectures in the same model is out of scope, but I don't think this will be a problem)

Sep 02 '24 02:09 compilade

How's this going?

Oct 01 '24 02:10 hg0428

Progress?

Jan 04 '25 06:01 theogbob

@theogbob , you may tag author, @compilade, which progress? )

Jan 28 '25 21:01 lexasub

Progress @compilade ?

Mar 12 '25 22:03 hg0428

@compilade Thanks for all the work here! I've also been working through a very similar architecture for bamba independently. Bamba is essentially the same as Jamba, but with mamba2 layers instead of mamba layers.

I suspect your implementation of llama_rs_cache is a much better approach than the one I took of simply creating a duplicate llama_kv_cache and conditionally making the two caches have zero-sized layers. I've also based my branch on your mamba2 work, so I'd be really interested in consolidating these threads and helping where possible with your work to support hybrid-recurrent models (we are really interested in these architectures at IBM).

It looks like this branch is pretty out of date with the latest refactors in the codebase. I have a version of my branch that I got working against the rebased tip of your mamba2 branch (BambaArchitectureRefactor), but it looks like it's out-of-date again based on further changes in the KV caching interface, and similarly it looks like the mamba2 branch is somewhat out of date at this point.

We just released an updated V2 of bamba, so I'd love to push forward with the architecture. If there's interest, I'd be happy to try to rebase this branch on the tip of master with all other refactors. I'm a lot less familiar with the kernel-level optimizations for mamba2, but could look at resolving conflicts there too.

Apr 29 '25 21:04 gabe-l-hart

I suspect your implementation of llama_rs_cache is a much better approach than the one I took of simply creating a duplicate llama_kv_cache and conditionally making the two caches have zero-sized layers.

@gabe-l-hart Interestingly, this sounds very similar to what I've done here. llama_rs_cache and llama_kv_cache have some mutually-exclusive zero-sized layers in Jamba.

Another approach like using per-layer cache types would need considerable additional refactoring which would conflict even more with https://github.com/ggml-org/llama.cpp/pull/12799 (although it might simplify https://github.com/ggml-org/llama.cpp/pull/13194).

(Now that I write this out, you're making me realize that all the kv-cache needs for hybrid models is per-type (e.g. self-attention and recurrent) top-level metadata (the cells) and some data buffers (of which there seem to always be up to 2 per layer (k and v, or r and s), since no layer ever has both Attention and recurrent states (at least this seems true for the hybrid models I've seen so far)). That is pretty much what is implemented here with the zero-sized layers, but this hints towards possible future simplifications (which will be doable after resolving conflicts from https://github.com/ggml-org/llama.cpp/pull/12799 here).)

I've also based my branch on your mamba2 work, so I'd be really interested in consolidating these threads and helping where possible with your work to support hybrid-recurrent models (we are really interested in these architectures at IBM).

I too would be interested in consolidating with your work, or at least making it easier for you to get Bamba supported. How would you prefer this to happen?

Note that I will update the mamba2 branch to keep up with the latest changes, and it may or may not result in lots of conflicts in your branch. I'm not sure if that's avoidable. Hopefully it's not too bad.

It looks like this branch is pretty out of date with the latest refactors in the codebase. I have a version of my branch that I got working against the rebased tip of your mamba2 branch (BambaArchitectureRefactor), but it looks like it's out-of-date again based on further changes in the KV caching interface, and similarly it looks like the mamba2 branch is somewhat out of date at this point.

Yes, this branch is not very up to date, but it's fixable. The main reason close to no progress was being made here was because I don't find it particularly fun to resolve thousands of lines of conflicts. Or at least I need to dedicate a good chunk of time to that so that I don't get lost half-way (since the conflict resolutions of this size mean mostly re-thinking the approach and porting it to the new structures).

So this PR might have been neglected for a while because the moments where I had enough time and the moments where I wanted to fix this and/or reply[^1] to "progress?" comments did not align.

[^1]: This comment did take me more than 3 hours to write. I should probably write smaller comments.

But I am in a period where I'm starting to have more spare time, and so I could dedicate a day (or more) to resolve the conflicts here and in #9126 (but I suspect it's going to take more than a day).

We just released an updated V2 of bamba, so I'd love to push forward with the architecture.

That's awesome!

If there's interest, I'd be happy to try to rebase this branch on the tip of master with all other refactors.

When branches drift that much, merging is usually simpler to handle than rebasing and still leaves a trail of tested versions, and allows resolving conflicts once (per merge) instead of at every commit which change conflicting parts.

But I see what you mean, and I'd love to get help with the conflict resolutions, but it's unfortunately something which almost has to be done in one go (because git doesn't have first-class conflicts), and so collaboration on that aspect isn't particularly straightforward.

I'm a lot less familiar with the kernel-level optimizations for mamba2, but could look at resolving conflicts there too.

Right, the Mamba2 branch (in #9126) modifies a bit how the SSM operator works (to minimize useless copies), and that will need to be adapted to the CUDA version of the operator which was added in https://github.com/ggml-org/llama.cpp/pull/10558.

May 01 '25 05:05 compilade

Thank you for the detailed response! It's really helpful. I 100% hear you on the giant merge conflicts, and I agree at this stage merging is better than rebasing.

I spent yesterday trying to resolve mamba2 with the latest master (https://github.com/gabe-l-hart/llama.cpp/tree/BambaAbstractMemory). It's not actually working yet, so I clearly missed something. I'll take another whack at it today and see how far I can get it. If I can get mamba2 working by itself, I may try to push on the hybrid architecture more.

It looks like the biggest change since I last synced is around moving to more abstract interfaces for things. In particular, it looks like all caching has moved behind the memory interface, though it gets liberally cast back to the unified cache type. This makes me think the intent is to move closer to how this is done in transformers where individual models can define their own cache semantics, but I'm not totally clear here yet. I'll post useful findings as I go unless you end up getting deep into it and making a lot of progress.

As always, thanks for the outstanding work here, 3-hour comments included!

May 01 '25 12:05 gabe-l-hart

Ok, I found my merge bugs in https://github.com/gabe-l-hart/llama.cpp/tree/BambaAbstractMemory and I'm now able to run a lightweight mamba2 model (details below).

As a separate question, this probably isn't the right place to centralize this discussion. Would it be best to create a central issue to discuss the convergence of mamba2, jamba, and bamba?

Details

# Download lightweight mamba2 model
huggingface-cli download AntonV/mamba2-370m-hf --local-dir ~/models/mamba2-370m-hf

# Convert to GGUF
python convert_hf_to_gguf.py ~/models/mamba2-370m-hf/

# Run a sample query
./build/bin/llama-cli -m ~/models/mamba2-370m-hf/mamba2-370M-hf-F16.gguf -p "Hello world" -ngl 0 --temp 0 -n 20

May 01 '25 17:05 gabe-l-hart

I'm now able to run a lightweight mamba2 model (details below).

@gabe-l-hart Amazing!

I've also merged from latest master (into https://github.com/ggml-org/llama.cpp/pull/9126), and some parts differ, but most is similar or the same.

It's very helpful to compare both merges to compare the approaches[^1] (and sometimes notice when changes are missing). It does reduce the stress of a bad merge. Thank you!

[^1]: with git diff 611a470fc1e25e7388c71734f09852a5d9c6ed06 6def5cd729fdde64b2addeaa5cce016c72485e06

(although it seems like git log --remerge-diff doesn't work on your merge; was it a squash merge perhaps?)

Multi-sequence inference is broken, though (that's also true on master with plain Mamba and RWKV). To test this, you can use:

$ ./build/bin/llama-parallel -m ~/models/mamba2-370m-hf/mamba2-370M-hf-F16.gguf -np 5 -ns 8 --temp 0 --repeat-penalty 1.1

Part of the problem is caused by an early return true in seq_rm, but there's another problem where it seems like the states are not properly isolated between sequences (which also seems to be a problem on master). I'll try to find a fix. I suspect it might be due to modifying const_cast-ed values, but it might be something else.

As a separate question, this probably isn't the right place to centralize this discussion. Would it be best to create a central issue to discuss the convergence of mamba2, jamba, and bamba?

Yes, I think that should be more appropriate. It's true that technically Mamba2 isn't directly related to Jamba. If I create the issue, I will tag you and refer to the relevant PRs and issues.

May 01 '25 22:05 compilade

@compilade Great to hear that you got the merge working, and not at all surprised that I missed some nuance beyond basic single-sequence generation. I'll look to pick up your changes on my branch.

(although it seems like git log --remerge-diff doesn't work on your merge; was it a squash merge perhaps?)

I've never used --remerge-diff! I love learning new tricks. I did not do anything with squashing intentionally, but I did ammend the merge commit a couple of times, so maybe that did it?

I did also start taking a whack at the hybrid cache based on the new layers of abstraction in llama-memory and llama-kv-cache. It's in a broken state, so nothing is pushed yet, but the approach I'm taking is to move everything in llama-context to use the llama_kv_cache abstract interface and then liberally hoisting methods from llama_kv_cache_unified up as part of the abstract method set in llama_kv_cache. This would then allow llama_kv_cache_hybrid to implement them by dispatching to the appropriate cache by layer.

The trickiest part seems to be the intermixing of kv_self_update in llama_context which currently needs intimate details of the member data from llama_kv_cache_unified. I tried moving all of that over into the kv cache class hierarchy, but it also needs intimate knowledge of graph creation and execution which seems to be correctly silo'ed in llama-context. I'll keep digging tomorrow!

May 02 '25 02:05 gabe-l-hart

It looks like the work of hoisting the cache abstraction is almost all done in https://github.com/ggml-org/llama.cpp/pull/12799! I'll move to build off of that branch.

May 02 '25 14:05 gabe-l-hart

llama : support Jamba hybrid Transformer-Mamba models

API changes

New features

Internal changes

TODO

Future ideas

Testing

Details