graph : reuse SSM graphs
Not sure if there is a reason not to enable graph reuse for recurrent graphs (mamba, hybrids, SSM, etc.). Did a few tests and seems to work, resulting in some modest perf improvements. cc @gabe-l-hart @compilade
Without graph reuse
make -j && LLAMA_GRAPH_REUSE_DISABLE=1 ./bin/llama-bench -m ../models/mamba-130m/ggml-model-f16.gguf -m ../models/granite-4-h-tiny/ggml-model-q8_0.gguf -m ../models/ai21-jamba-mini-1.7/ggml-model-q8_0.gguf -m ../models/liquidai-lfm2-2.6b/ggml-model-q4_k.gguf -fa 1 -t 1 -n 32
| model | size | params | backend | ngl | threads | fa | test | t/s |
|---|---|---|---|---|---|---|---|---|
| mamba 0.1B F16 | 256.96 MiB | 129.14 M | Metal | 99 | 1 | 1 | pp512 | 8415.73 ± 46.47 |
| mamba 0.1B F16 | 256.96 MiB | 129.14 M | Metal | 99 | 1 | 1 | tg32 | 322.74 ± 0.64 |
| granitehybrid ?B Q8_0 | 6.88 GiB | 6.94 B | Metal | 99 | 1 | 1 | pp512 | 2119.36 ± 3.31 |
| granitehybrid ?B Q8_0 | 6.88 GiB | 6.94 B | Metal | 99 | 1 | 1 | tg32 | 77.17 ± 0.11 |
| jamba ?B Q8_0 | 51.05 GiB | 51.57 B | Metal | 99 | 1 | 1 | pp512 | 603.47 ± 1.83 |
| jamba ?B Q8_0 | 51.05 GiB | 51.57 B | Metal | 99 | 1 | 1 | tg32 | 42.35 ± 0.02 |
| lfm2 2.6B Q4_K - Medium | 1.45 GiB | 2.57 B | Metal | 99 | 1 | 1 | pp512 | 2923.41 ± 3.20 |
| lfm2 2.6B Q4_K - Medium | 1.45 GiB | 2.57 B | Metal | 99 | 1 | 1 | tg32 | 169.83 ± 0.67 |
| build: 638e2c239 (6725) |
With graph reuse
make -j && ./bin/llama-bench -m ../models/mamba-130m/ggml-model-f16.gguf -m ../models/granite-4-h-tiny/ggml-model-q8_0.gguf -m ../models/ai21-jamba-mini-1.7/ggml-model-q8_0.gguf -m ../models/liquidai-lfm2-2.6b/ggml-model-q4_k.gguf -fa 1 -t 1 -n 32
| model | size | params | backend | ngl | threads | fa | test | t/s |
|---|---|---|---|---|---|---|---|---|
| mamba 0.1B F16 | 256.96 MiB | 129.14 M | Metal | 99 | 1 | 1 | pp512 | 8453.65 ± 20.10 |
| mamba 0.1B F16 | 256.96 MiB | 129.14 M | Metal | 99 | 1 | 1 | tg32 | 348.83 ± 1.67 |
| granitehybrid ?B Q8_0 | 6.88 GiB | 6.94 B | Metal | 99 | 1 | 1 | pp512 | 2126.12 ± 1.90 |
| granitehybrid ?B Q8_0 | 6.88 GiB | 6.94 B | Metal | 99 | 1 | 1 | tg32 | 82.26 ± 0.13 |
| jamba ?B Q8_0 | 51.05 GiB | 51.57 B | Metal | 99 | 1 | 1 | pp512 | 604.56 ± 2.08 |
| jamba ?B Q8_0 | 51.05 GiB | 51.57 B | Metal | 99 | 1 | 1 | tg32 | 43.22 ± 0.02 |
| lfm2 2.6B Q4_K - Medium | 1.45 GiB | 2.57 B | Metal | 99 | 1 | 1 | pp512 | 2928.31 ± 1.78 |
| lfm2 2.6B Q4_K - Medium | 1.45 GiB | 2.57 B | Metal | 99 | 1 | 1 | tg32 | 179.18 ± 0.47 |
| build: 638e2c239 (6725) |
Very cool! I'll test shortly with Granite 4.
The only thought I've had about why this might be difficult is around implementing the SSD version of SSM_SCAN. In mamba_ssm and mlx, they conditionally use SSD if (and only if) the cache is empty and the sequence length is >1. Since SSD is composed of a bunch of other smaller ops (tril and cumsum), one way this could be implemented is at the graph building layer which would result in different graphs for different parts of the generate loop. That said, this could also be implemented inside the SSM_SCAN kernel dispatching layer, so I don't think it's a blocker for reusing graphs.
@gabe-l-hart What is SSD?
What is SSD?
Sorry, commenting from my phone at the airport! SSD is the State Space Duality part of the mamba2 paper where they reframe the SSM_SCAN op as an attention operation. The mlx implementation is here and the original triton kernel is here. I'm still working on actually grokking the math and was hoping to try to get it implemented in ggml soon-ish. It should provide a nice performance boost for prefill.
Results looking good for granite4:micro-h (using the GGUF we uploaded to Ollama):
Metal
Reuse on, fa on
./bin/llama-batched-bench -m $(find-ollama-gguf.sh granite4:micro-h) -c 2048 -b 2048 -ub 512 -npp 128,256 -ntg 128 -npl 1,2,4 -ngl 99 -fa on
| PP | TG | B | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | T s | S t/s |
|---|---|---|---|---|---|---|---|---|---|
| 128 | 128 | 1 | 256 | 0.166 | 770.58 | 1.617 | 79.16 | 1.783 | 143.56 |
| 128 | 128 | 2 | 512 | 0.309 | 828.03 | 4.041 | 63.35 | 4.350 | 117.69 |
| 128 | 128 | 4 | 1024 | 0.598 | 856.59 | 7.172 | 71.39 | 7.770 | 131.80 |
| 256 | 128 | 1 | 384 | 0.307 | 834.22 | 1.628 | 78.63 | 1.935 | 198.48 |
| 256 | 128 | 2 | 768 | 0.593 | 863.30 | 4.048 | 63.24 | 4.641 | 165.48 |
| 256 | 128 | 4 | 1536 | 1.191 | 860.14 | 7.162 | 71.48 | 8.353 | 183.89 |
Reuse on, fa off
./bin/llama-batched-bench -m $(find-ollama-gguf.sh granite4:micro-h) -c 2048 -b 2048 -ub 512 -npp 128,256 -ntg 128 -npl 1,2,4 -ngl 99 -fa off
| PP | TG | B | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | T s | S t/s |
|---|---|---|---|---|---|---|---|---|---|
| 128 | 128 | 1 | 256 | 0.167 | 768.56 | 1.697 | 75.43 | 1.863 | 137.38 |
| 128 | 128 | 2 | 512 | 0.310 | 826.56 | 4.130 | 61.99 | 4.440 | 115.32 |
| 128 | 128 | 4 | 1024 | 0.599 | 854.75 | 7.232 | 70.80 | 7.831 | 130.76 |
| 256 | 128 | 1 | 384 | 0.307 | 833.12 | 1.705 | 75.08 | 2.012 | 190.84 |
| 256 | 128 | 2 | 768 | 0.594 | 861.28 | 4.175 | 61.32 | 4.770 | 161.02 |
| 256 | 128 | 4 | 1536 | 1.193 | 858.02 | 7.237 | 70.75 | 8.430 | 182.20 |
Reuse off, fa on
LLAMA_GRAPH_REUSE_DISABLE=1 ./bin/llama-batched-bench -m $(find-ollama-gguf.sh granite4:micro-h) -c 2048 -b 2048 -ub 512 -npp 128,256 -ntg 128 -npl 1,2,4 -ngl 99 -fa on
| PP | TG | B | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | T s | S t/s |
|---|---|---|---|---|---|---|---|---|---|
| 128 | 128 | 1 | 256 | 0.166 | 769.00 | 1.714 | 74.69 | 1.880 | 136.16 |
| 128 | 128 | 2 | 512 | 0.309 | 828.93 | 4.209 | 60.83 | 4.517 | 113.34 |
| 128 | 128 | 4 | 1024 | 0.598 | 855.49 | 7.282 | 70.31 | 7.881 | 129.94 |
| 256 | 128 | 1 | 384 | 0.307 | 834.57 | 1.763 | 72.61 | 2.070 | 185.55 |
| 256 | 128 | 2 | 768 | 0.593 | 864.13 | 4.176 | 61.30 | 4.769 | 161.04 |
| 256 | 128 | 4 | 1536 | 1.190 | 860.30 | 7.291 | 70.22 | 8.481 | 181.10 |
Reuse off, fa off
LLAMA_GRAPH_REUSE_DISABLE=1 ./bin/llama-batched-bench -m $(find-ollama-gguf.sh granite4:micro-h) -c 2048 -b 2048 -ub 512 -npp 128,256 -ntg 128 -npl 1,2,4 -ngl 99 -fa off
| PP | TG | B | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | T s | S t/s |
|---|---|---|---|---|---|---|---|---|---|
| 128 | 128 | 1 | 256 | 0.170 | 751.10 | 1.790 | 71.50 | 1.961 | 130.58 |
| 128 | 128 | 2 | 512 | 0.310 | 826.98 | 4.190 | 61.10 | 4.499 | 113.79 |
| 128 | 128 | 4 | 1024 | 0.602 | 850.25 | 7.299 | 70.15 | 7.901 | 129.60 |
| 256 | 128 | 1 | 384 | 0.309 | 829.05 | 1.793 | 71.38 | 2.102 | 182.67 |
| 256 | 128 | 2 | 768 | 0.596 | 858.70 | 4.220 | 60.67 | 4.816 | 159.46 |
| 256 | 128 | 4 | 1536 | 1.193 | 858.16 | 7.309 | 70.05 | 8.502 | 180.66 |
@gabe-l-hart as a side note, I've added cumsum and tri as new ops during the Qwen3Next implementation, so that might allow for some decoupling.
I've added cumsum and tri as new ops during the Qwen3Next implementation, so that might allow for some decoupling.
@pwilkin I thought I saw that trying to keep up with the comments! It's high on my todo list after this conference to get into your PR (partly selfishly because I want to reuse these parts)
@gabe-l-hart Parallel performance of SSMs should be fixed with #16494
Thank you for digging into these performance improvements!
I'm hitting errors on metal with the most recent changes on this branch:
lldb ./bin/llama-cli -- -m $(find-ollama-gguf.sh granite4:micro-h) -no-cnv -p "tell me a story about a developer and their dog?" -ngl 99 --temp 0
tell me a story about a developer and their dog? The response mustProcess 95451 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x3c)
frame #0: 0x00000001016f7e00 libllama.dylib`llama_memory_recurrent_context::get_head(this=0x0000000000000000) const at llama-memory-recurrent.cpp:1144:12
1141 }
1142
1143 uint32_t llama_memory_recurrent_context::get_head() const {
-> 1144 return head;
1145 }
1146
1147 int32_t llama_memory_recurrent_context::get_rs_z() const {
Target 0: (llama-cli) stopped.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x3c)
* frame #0: 0x00000001016f7e00 libllama.dylib`llama_memory_recurrent_context::get_head(this=0x0000000000000000) const at llama-memory-recurrent.cpp:1144:12
frame #1: 0x00000001016ad078 libllama.dylib`llm_graph_input_mem_hybrid::can_reuse(this=0x0000600000258500, params=0x000000016fdf5b88) at llama-graph.cpp:481:52
frame #2: 0x00000001016adc60 libllama.dylib`llm_graph_result::can_reuse(this=0x0000000121810600, params=0x000000016fdf5b88) at llama-graph.cpp:565:33
frame #3: 0x000000010164f0f0 libllama.dylib`llama_context::process_ubatch(this=0x0000000102904080, ubatch=0x0000600000750540, gtype=LLM_GRAPH_TYPE_DECODER, mctx=0x00006000039732c0, ret=0x000000016fdf9cd4) at llama-context.cpp:746:38
frame #4: 0x0000000101650b24 libllama.dylib`llama_context::decode(this=0x0000000102904080, batch_inp=0x000000016fdfac88) at llama-context.cpp:1088:28
frame #5: 0x0000000101656a68 libllama.dylib`llama_decode(ctx=0x0000000102904080, batch=llama_batch @ 0x000000016fdfac88) at llama-context.cpp:2747:26
frame #6: 0x0000000100006fb8 llama-cli`main(argc=10, argv=0x000000016fdfd380) at main.cpp:671:21
frame #7: 0x000000019fe72b98 dyld`start + 6076
I'll investigate further, but wanted to post in case it's about to be merged
It looks like it broken in 6589d3b8fcca803e4f2d4ad7da3ff8e87dfaf9ad for me
Should be ok now. I mistakenly thought that the old mctx of the input would be valid. Let me know if you spot any other issues.
Confirmed, it's working again for me! I'll test a little further with parallel sequences, but I think it's probably ready
Hitting assertions with llama-parallel:
lldb ./bin/llama-parallel -- -m $(find-ollama-gguf.sh granite4:micro-h) -ngl 99 -fa on -ns 10 -np 10
main: clearing the KV cache
Client 0, seq 0, junk = 0, prompt = 267, started decoding ...
Client 1, seq 1, junk = 0, prompt = 267, started decoding ...
Client 2, seq 2, junk = 0, prompt = 267, started decoding ...
Client 3, seq 3, junk = 0, prompt = 270, started decoding ...
Client 4, seq 4, junk = 0, prompt = 273, started decoding ...
Client 5, seq 5, junk = 0, prompt = 267, started decoding ...
Client 6, seq 6, junk = 0, prompt = 273, started decoding ...
Client 7, seq 7, junk = 0, prompt = 273, started decoding ...
Client 8, seq 8, junk = 0, prompt = 273, started decoding ...
Client 9, seq 9, junk = 0, prompt = 270, started decoding ...
/Users/ghart/Projects/github/ggml-org/llama.cpp/ggml/src/ggml.c:1648: GGML_ASSERT(view_src == NULL || data_size == 0 || data_size + view_offs <= ggml_nbytes(view_src)) failed
(lldb) process attach --pid 23228
error: attach failed: tried to attach to process already being debugged
Process 23228 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGABRT
frame #0: 0x00000001a01da388 libsystem_kernel.dylib`__pthread_kill + 8
libsystem_kernel.dylib`__pthread_kill:
-> 0x1a01da388 <+8>: b.lo 0x1a01da3a8 ; <+40>
0x1a01da38c <+12>: pacibsp
0x1a01da390 <+16>: stp x29, x30, [sp, #-0x10]!
0x1a01da394 <+20>: mov x29, sp
Target 0: (llama-parallel) stopped.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGABRT
* frame #0: 0x00000001a01da388 libsystem_kernel.dylib`__pthread_kill + 8
frame #1: 0x00000001a021388c libsystem_pthread.dylib`pthread_kill + 296
frame #2: 0x00000001a011ca3c libsystem_c.dylib`abort + 124
frame #3: 0x0000000101211554 libggml-base.dylib`ggml_abort(file="/Users/ghart/Projects/github/ggml-org/llama.cpp/ggml/src/ggml.c", line=1648, fmt="GGML_ASSERT(%s) failed") at ggml.c:233:5
frame #4: 0x0000000101213adc libggml-base.dylib`ggml_new_tensor_impl(ctx=0x00006000005fa040, type=GGML_TYPE_I32, n_dims=1, ne=0x000000016fdf6218, view_src=0x00000001304e0740, view_offs=0) at ggml.c:1648:5
frame #5: 0x0000000101218e08 libggml-base.dylib`ggml_view_impl(ctx=0x00006000005fa040, a=0x00000001304e0740, n_dims=1, ne=0x000000016fdf6218, offset=0) at ggml.c:3477:35
frame #6: 0x0000000101218dac libggml-base.dylib`ggml_view_1d(ctx=0x00006000005fa040, a=0x00000001304e0740, ne0=8, offset=0) at ggml.c:3495:35
frame #7: 0x00000001016afe0c libllama.dylib`build_rs_inp_impl(ctx0=0x00006000005fa040, ubatch=0x000000016fdfaa08, mctx_cur=0x000060000354b430) at llama-graph.cpp:1839:25
frame #8: 0x00000001016b0398 libllama.dylib`llm_graph_context::build_inp_mem_hybrid(this=0x000000013bf1f1d0) const at llama-graph.cpp:1910:21
frame #9: 0x00000001017ea7dc libllama.dylib`llm_build_granite_hybrid::llm_build_granite_hybrid(this=0x000000013bf1f1d0, model=0x000000013e020e00, params=0x000000016fdf6c28) at llama-model.cpp:16190:22
frame #10: 0x00000001017ea6b8 libllama.dylib`llm_build_granite_hybrid::llm_build_granite_hybrid(this=0x000000013bf1f1d0, model=0x000000013e020e00, params=0x000000016fdf6c28) at llama-model.cpp:16180:41
frame #11: 0x0000000101775774 libllama.dylib`std::__1::__unique_if<llm_build_granite_hybrid>::__unique_single std::__1::make_unique[abi:ne190102]<llm_build_granite_hybrid, llama_model const&, llm_graph_params const&>(__args=0x000000013e020e00, __args=0x000000016fdf6c28) at unique_ptr.h:635:30
frame #12: 0x0000000101770c48 libllama.dylib`llama_model::build_graph(this=0x000000013e020e00, params=0x000000016fdf6c28) const at llama-model.cpp:19824:23
frame #13: 0x000000010164b180 libllama.dylib`llama_context::process_ubatch(this=0x0000000120a04080, ubatch=0x000000013bf1e1a0, gtype=LLM_GRAPH_TYPE_DECODER, mctx=0x0000600000369f40, ret=0x000000016fdfad74) at llama-context.cpp:758:20
frame #14: 0x000000010164cb24 libllama.dylib`llama_context::decode(this=0x0000000120a04080, batch_inp=0x000000016fdfb600) at llama-context.cpp:1088:28
frame #15: 0x0000000101652a68 libllama.dylib`llama_decode(ctx=0x0000000120a04080, batch=llama_batch @ 0x000000016fdfb600) at llama-context.cpp:2747:26
frame #16: 0x000000010000410c llama-parallel`main(argc=11, argv=0x000000016fdfd3e0) at parallel.cpp:402:29
frame #17: 0x000000019fe72b98 dyld`start + 6076
Just confirmed that I don't hit these on master (81086cd6a)
Running cleanly with those reverts
In case it's helpful, I was seeing it consistently on the second call to build_inp_mem_hybrid during the parallel portion of the test
debug logs
llama_kv_cache: size = 352.00 MiB ( 4096 cells, 4 layers, 11/11 seqs), K (f16): 176.00 MiB, V (f16): 176.00 MiB
llama_memory_recurrent: Metal RS buffer size = 811.72 MiB
llama_memory_recurrent: size = 811.72 MiB ( 11 cells, 40 layers, 11 seqs), R (f32): 19.72 MiB, S (f32): 792.00 MiB
Process 35994 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 2.1
frame #0: 0x00000001016b036c libllama.dylib`llm_graph_context::build_inp_mem_hybrid(this=0x000000013380a160) const at llama-graph.cpp:1910:44
1907 llm_graph_input_mem_hybrid * llm_graph_context::build_inp_mem_hybrid() const {
1908 const auto * mctx_cur = static_cast<const llama_memory_hybrid_context *>(mctx);
1909
-> 1910 auto inp_rs = build_rs_inp_impl (ctx0, ubatch, mctx_cur->get_recr());
1911 auto inp_attn = build_attn_inp_kv_impl(ctx0, ubatch, hparams, cparams, mctx_cur->get_attn());
1912
1913 auto inp = std::make_unique<llm_graph_input_mem_hybrid>(cparams, std::move(inp_attn), std::move(inp_rs), mctx_cur);
Target 0: (llama-parallel) stopped.
(lldb) c
Process 35994 resuming
Process 35994 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 2.1
frame #0: 0x00000001016b036c libllama.dylib`llm_graph_context::build_inp_mem_hybrid(this=0x0000000132e09b70) const at llama-graph.cpp:1910:44
1907 llm_graph_input_mem_hybrid * llm_graph_context::build_inp_mem_hybrid() const {
1908 const auto * mctx_cur = static_cast<const llama_memory_hybrid_context *>(mctx);
1909
-> 1910 auto inp_rs = build_rs_inp_impl (ctx0, ubatch, mctx_cur->get_recr());
1911 auto inp_attn = build_attn_inp_kv_impl(ctx0, ubatch, hparams, cparams, mctx_cur->get_attn());
1912
1913 auto inp = std::make_unique<llm_graph_input_mem_hybrid>(cparams, std::move(inp_attn), std::move(inp_rs), mctx_cur);
Target 0: (llama-parallel) stopped.
(lldb) c
Process 35994 resuming
Process 35994 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 2.1
frame #0: 0x00000001016b036c libllama.dylib`llm_graph_context::build_inp_mem_hybrid(this=0x0000000132e09b70) const at llama-graph.cpp:1910:44
1907 llm_graph_input_mem_hybrid * llm_graph_context::build_inp_mem_hybrid() const {
1908 const auto * mctx_cur = static_cast<const llama_memory_hybrid_context *>(mctx);
1909
-> 1910 auto inp_rs = build_rs_inp_impl (ctx0, ubatch, mctx_cur->get_recr());
1911 auto inp_attn = build_attn_inp_kv_impl(ctx0, ubatch, hparams, cparams, mctx_cur->get_attn());
1912
1913 auto inp = std::make_unique<llm_graph_input_mem_hybrid>(cparams, std::move(inp_attn), std::move(inp_rs), mctx_cur);
Target 0: (llama-parallel) stopped.
(lldb) c
Process 35994 resuming
llama_context: Metal compute buffer size = 256.67 MiB
llama_context: CPU compute buffer size = 15.05 MiB
llama_context: graph nodes = 2303
llama_context: graph splits = 3
common_init_from_params: added <|end_of_text|> logit bias = -inf
common_init_from_params: added <|fim_pad|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
Process 35994 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 2.1
frame #0: 0x00000001016b036c libllama.dylib`llm_graph_context::build_inp_mem_hybrid(this=0x0000000102f04080) const at llama-graph.cpp:1910:44
1907 llm_graph_input_mem_hybrid * llm_graph_context::build_inp_mem_hybrid() const {
1908 const auto * mctx_cur = static_cast<const llama_memory_hybrid_context *>(mctx);
1909
-> 1910 auto inp_rs = build_rs_inp_impl (ctx0, ubatch, mctx_cur->get_recr());
1911 auto inp_attn = build_attn_inp_kv_impl(ctx0, ubatch, hparams, cparams, mctx_cur->get_attn());
1912
1913 auto inp = std::make_unique<llm_graph_input_mem_hybrid>(cparams, std::move(inp_attn), std::move(inp_rs), mctx_cur);
Target 0: (llama-parallel) stopped.
(lldb) c
Process 35994 resuming
2025-10-10 10:41:04.730362-0600 llama-parallel[35994:106223077] flock failed to lock list file (/var/folders/20/4th8f1dj2t15_21ygkdhskdc0000gn/C//com.apple.metal/16777235_419/functions.list): errno = 35
No new questions so proceed with build-in defaults.
main: initializing samplers with different RNG seeds, starting from -1
main: Simulating parallel requests from clients:
main: n_parallel = 10, n_sequences = 10, cont_batching = 1, system tokens = 256
Processing requests ...
main: clearing the KV cache
Client 0, seq 0, junk = 0, prompt = 267, started decoding ...
Client 1, seq 1, junk = 0, prompt = 267, started decoding ...
Client 2, seq 2, junk = 0, prompt = 267, started decoding ...
Client 3, seq 3, junk = 0, prompt = 270, started decoding ...
Client 4, seq 4, junk = 0, prompt = 273, started decoding ...
Client 5, seq 5, junk = 0, prompt = 267, started decoding ...
Client 6, seq 6, junk = 0, prompt = 273, started decoding ...
Client 7, seq 7, junk = 0, prompt = 273, started decoding ...
Client 8, seq 8, junk = 0, prompt = 273, started decoding ...
Client 9, seq 9, junk = 0, prompt = 270, started decoding ...
Process 35994 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 2.1
frame #0: 0x00000001016b036c libllama.dylib`llm_graph_context::build_inp_mem_hybrid(this=0x0000000132f0c4b0) const at llama-graph.cpp:1910:44
1907 llm_graph_input_mem_hybrid * llm_graph_context::build_inp_mem_hybrid() const {
1908 const auto * mctx_cur = static_cast<const llama_memory_hybrid_context *>(mctx);
1909
-> 1910 auto inp_rs = build_rs_inp_impl (ctx0, ubatch, mctx_cur->get_recr());
1911 auto inp_attn = build_attn_inp_kv_impl(ctx0, ubatch, hparams, cparams, mctx_cur->get_attn());
1912
1913 auto inp = std::make_unique<llm_graph_input_mem_hybrid>(cparams, std::move(inp_attn), std::move(inp_rs), mctx_cur);
Target 0: (llama-parallel) stopped.
(lldb) c
Process 35994 resuming
Process 35994 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 2.1
frame #0: 0x00000001016b036c libllama.dylib`llm_graph_context::build_inp_mem_hybrid(this=0x0000000132f0c4b0) const at llama-graph.cpp:1910:44
1907 llm_graph_input_mem_hybrid * llm_graph_context::build_inp_mem_hybrid() const {
1908 const auto * mctx_cur = static_cast<const llama_memory_hybrid_context *>(mctx);
1909
-> 1910 auto inp_rs = build_rs_inp_impl (ctx0, ubatch, mctx_cur->get_recr());
1911 auto inp_attn = build_attn_inp_kv_impl(ctx0, ubatch, hparams, cparams, mctx_cur->get_attn());
1912
1913 auto inp = std::make_unique<llm_graph_input_mem_hybrid>(cparams, std::move(inp_attn), std::move(inp_rs), mctx_cur);
Target 0: (llama-parallel) stopped.
(lldb) c
Process 35994 resuming
/Users/ghart/Projects/github/ggml-org/llama.cpp/ggml/src/ggml.c:1648: GGML_ASSERT(view_src == NULL || data_size == 0 || data_size + view_offs <= ggml_nbytes(view_src)) failed
(lldb) process attach --pid 35994
error: attach failed: tried to attach to process already being debugged
Process 35994 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
frame #0: 0x0000000101211550 libggml-base.dylib`ggml_abort(file="/Users/ghart/Projects/github/ggml-org/llama.cpp/ggml/src/ggml.c", line=1648, fmt="GGML_ASSERT(%s) failed") at ggml.c:233:5
230 ggml_print_backtrace();
231 }
232
-> 233 abort();
234 }
235
236 // ggml_print_backtrace is registered with std::set_terminate by ggml.cpp
Target 0: (llama-parallel) stopped.
So the change in 00f115fe810815d4a22a6dee0acc346131e970e1 does not work for some reason. We want to eventually extract the state of the recurrent memory into the memory context as we do with the KV cache implementations. But I think there is something being mutated when it should not be. For now, let's revert this and figure it out later.
To clarify, the design is that when building the graph we should only reference data that is stored in the memory context (i.e. in llama_memory_recurrent_context), and not in the memory itself (i.e. in llama_memory_recurrent). Except for some constant members such as the ggml tensors for example.
Got it, that makes sense.