Not sure if there is a reason not to enable graph reuse for recurrent graphs (mamba, hybrids, SSM, etc.). Did a few tests and seems to work, resulting in some modest perf improvements. cc @gabe-l-hart @compilade

Without graph reuse

make -j && LLAMA_GRAPH_REUSE_DISABLE=1 ./bin/llama-bench -m ../models/mamba-130m/ggml-model-f16.gguf -m ../models/granite-4-h-tiny/ggml-model-q8_0.gguf -m ../models/ai21-jamba-mini-1.7/ggml-model-q8_0.gguf -m ../models/liquidai-lfm2-2.6b/ggml-model-q4_k.gguf -fa 1 -t 1 -n 32

model	size	params	backend	ngl	threads	fa	test	t/s
mamba 0.1B F16	256.96 MiB	129.14 M	Metal	99	1	1	pp512	8415.73 ± 46.47
mamba 0.1B F16	256.96 MiB	129.14 M	Metal	99	1	1	tg32	322.74 ± 0.64
granitehybrid ?B Q8_0	6.88 GiB	6.94 B	Metal	99	1	1	pp512	2119.36 ± 3.31
granitehybrid ?B Q8_0	6.88 GiB	6.94 B	Metal	99	1	1	tg32	77.17 ± 0.11
jamba ?B Q8_0	51.05 GiB	51.57 B	Metal	99	1	1	pp512	603.47 ± 1.83
jamba ?B Q8_0	51.05 GiB	51.57 B	Metal	99	1	1	tg32	42.35 ± 0.02
lfm2 2.6B Q4_K - Medium	1.45 GiB	2.57 B	Metal	99	1	1	pp512	2923.41 ± 3.20
lfm2 2.6B Q4_K - Medium	1.45 GiB	2.57 B	Metal	99	1	1	tg32	169.83 ± 0.67
build: 638e2c239 (6725)

With graph reuse

make -j && ./bin/llama-bench -m ../models/mamba-130m/ggml-model-f16.gguf -m ../models/granite-4-h-tiny/ggml-model-q8_0.gguf -m ../models/ai21-jamba-mini-1.7/ggml-model-q8_0.gguf -m ../models/liquidai-lfm2-2.6b/ggml-model-q4_k.gguf -fa 1 -t 1 -n 32

model	size	params	backend	ngl	threads	fa	test	t/s
mamba 0.1B F16	256.96 MiB	129.14 M	Metal	99	1	1	pp512	8453.65 ± 20.10
mamba 0.1B F16	256.96 MiB	129.14 M	Metal	99	1	1	tg32	348.83 ± 1.67
granitehybrid ?B Q8_0	6.88 GiB	6.94 B	Metal	99	1	1	pp512	2126.12 ± 1.90
granitehybrid ?B Q8_0	6.88 GiB	6.94 B	Metal	99	1	1	tg32	82.26 ± 0.13
jamba ?B Q8_0	51.05 GiB	51.57 B	Metal	99	1	1	pp512	604.56 ± 2.08
jamba ?B Q8_0	51.05 GiB	51.57 B	Metal	99	1	1	tg32	43.22 ± 0.02
lfm2 2.6B Q4_K - Medium	1.45 GiB	2.57 B	Metal	99	1	1	pp512	2928.31 ± 1.78
lfm2 2.6B Q4_K - Medium	1.45 GiB	2.57 B	Metal	99	1	1	tg32	179.18 ± 0.47
build: 638e2c239 (6725)

Oct 09 '25 16:10 ggerganov

Very cool! I'll test shortly with Granite 4.

The only thought I've had about why this might be difficult is around implementing the SSD version of SSM_SCAN. In mamba_ssm and mlx, they conditionally use SSD if (and only if) the cache is empty and the sequence length is >1. Since SSD is composed of a bunch of other smaller ops (tril and cumsum), one way this could be implemented is at the graph building layer which would result in different graphs for different parts of the generate loop. That said, this could also be implemented inside the SSM_SCAN kernel dispatching layer, so I don't think it's a blocker for reusing graphs.

Oct 09 '25 17:10 gabe-l-hart

@gabe-l-hart What is SSD?

Oct 09 '25 17:10 ggerganov

What is SSD?

Sorry, commenting from my phone at the airport! SSD is the State Space Duality part of the mamba2 paper where they reframe the SSM_SCAN op as an attention operation. The mlx implementation is here and the original triton kernel is here. I'm still working on actually grokking the math and was hoping to try to get it implemented in ggml soon-ish. It should provide a nice performance boost for prefill.

Oct 09 '25 17:10 gabe-l-hart

Results looking good for granite4:micro-h (using the GGUF we uploaded to Ollama):

Metal

Reuse on, fa on

./bin/llama-batched-bench -m $(find-ollama-gguf.sh granite4:micro-h) -c 2048 -b 2048 -ub 512 -npp 128,256 -ntg 128 -npl 1,2,4 -ngl 99 -fa on

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
128	128	1	256	0.166	770.58	1.617	79.16	1.783	143.56
128	128	2	512	0.309	828.03	4.041	63.35	4.350	117.69
128	128	4	1024	0.598	856.59	7.172	71.39	7.770	131.80
256	128	1	384	0.307	834.22	1.628	78.63	1.935	198.48
256	128	2	768	0.593	863.30	4.048	63.24	4.641	165.48
256	128	4	1536	1.191	860.14	7.162	71.48	8.353	183.89

Reuse on, fa off

./bin/llama-batched-bench -m $(find-ollama-gguf.sh granite4:micro-h) -c 2048 -b 2048 -ub 512 -npp 128,256 -ntg 128 -npl 1,2,4 -ngl 99 -fa off

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
128	128	1	256	0.167	768.56	1.697	75.43	1.863	137.38
128	128	2	512	0.310	826.56	4.130	61.99	4.440	115.32
128	128	4	1024	0.599	854.75	7.232	70.80	7.831	130.76
256	128	1	384	0.307	833.12	1.705	75.08	2.012	190.84
256	128	2	768	0.594	861.28	4.175	61.32	4.770	161.02
256	128	4	1536	1.193	858.02	7.237	70.75	8.430	182.20

Reuse off, fa on

LLAMA_GRAPH_REUSE_DISABLE=1 ./bin/llama-batched-bench -m $(find-ollama-gguf.sh granite4:micro-h) -c 2048 -b 2048 -ub 512 -npp 128,256 -ntg 128 -npl 1,2,4 -ngl 99 -fa on

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
128	128	1	256	0.166	769.00	1.714	74.69	1.880	136.16
128	128	2	512	0.309	828.93	4.209	60.83	4.517	113.34
128	128	4	1024	0.598	855.49	7.282	70.31	7.881	129.94
256	128	1	384	0.307	834.57	1.763	72.61	2.070	185.55
256	128	2	768	0.593	864.13	4.176	61.30	4.769	161.04
256	128	4	1536	1.190	860.30	7.291	70.22	8.481	181.10

Reuse off, fa off

LLAMA_GRAPH_REUSE_DISABLE=1 ./bin/llama-batched-bench -m $(find-ollama-gguf.sh granite4:micro-h) -c 2048 -b 2048 -ub 512 -npp 128,256 -ntg 128 -npl 1,2,4 -ngl 99 -fa off

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
128	128	1	256	0.170	751.10	1.790	71.50	1.961	130.58
128	128	2	512	0.310	826.98	4.190	61.10	4.499	113.79
128	128	4	1024	0.602	850.25	7.299	70.15	7.901	129.60
256	128	1	384	0.309	829.05	1.793	71.38	2.102	182.67
256	128	2	768	0.596	858.70	4.220	60.67	4.816	159.46
256	128	4	1536	1.193	858.16	7.309	70.05	8.502	180.66

Oct 09 '25 17:10 gabe-l-hart

@gabe-l-hart as a side note, I've added cumsum and tri as new ops during the Qwen3Next implementation, so that might allow for some decoupling.

Oct 09 '25 18:10 pwilkin

I've added cumsum and tri as new ops during the Qwen3Next implementation, so that might allow for some decoupling.

@pwilkin I thought I saw that trying to keep up with the comments! It's high on my todo list after this conference to get into your PR (partly selfishly because I want to reuse these parts)

Oct 09 '25 18:10 gabe-l-hart

@gabe-l-hart Parallel performance of SSMs should be fixed with #16494

Oct 10 '25 07:10 ggerganov

Thank you for digging into these performance improvements!

Oct 10 '25 11:10 gabe-l-hart

I'm hitting errors on metal with the most recent changes on this branch:

lldb ./bin/llama-cli -- -m $(find-ollama-gguf.sh granite4:micro-h) -no-cnv -p "tell me a story about a developer and their dog?" -ngl 99 --temp 0

tell me a story about a developer and their dog? The response mustProcess 95451 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x3c)
    frame #0: 0x00000001016f7e00 libllama.dylib`llama_memory_recurrent_context::get_head(this=0x0000000000000000) const at llama-memory-recurrent.cpp:1144:12
   1141	}
   1142	
   1143	uint32_t llama_memory_recurrent_context::get_head() const {
-> 1144	    return head;
   1145	}
   1146	
   1147	int32_t llama_memory_recurrent_context::get_rs_z() const {
Target 0: (llama-cli) stopped.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x3c)
  * frame #0: 0x00000001016f7e00 libllama.dylib`llama_memory_recurrent_context::get_head(this=0x0000000000000000) const at llama-memory-recurrent.cpp:1144:12
    frame #1: 0x00000001016ad078 libllama.dylib`llm_graph_input_mem_hybrid::can_reuse(this=0x0000600000258500, params=0x000000016fdf5b88) at llama-graph.cpp:481:52
    frame #2: 0x00000001016adc60 libllama.dylib`llm_graph_result::can_reuse(this=0x0000000121810600, params=0x000000016fdf5b88) at llama-graph.cpp:565:33
    frame #3: 0x000000010164f0f0 libllama.dylib`llama_context::process_ubatch(this=0x0000000102904080, ubatch=0x0000600000750540, gtype=LLM_GRAPH_TYPE_DECODER, mctx=0x00006000039732c0, ret=0x000000016fdf9cd4) at llama-context.cpp:746:38
    frame #4: 0x0000000101650b24 libllama.dylib`llama_context::decode(this=0x0000000102904080, batch_inp=0x000000016fdfac88) at llama-context.cpp:1088:28
    frame #5: 0x0000000101656a68 libllama.dylib`llama_decode(ctx=0x0000000102904080, batch=llama_batch @ 0x000000016fdfac88) at llama-context.cpp:2747:26
    frame #6: 0x0000000100006fb8 llama-cli`main(argc=10, argv=0x000000016fdfd380) at main.cpp:671:21
    frame #7: 0x000000019fe72b98 dyld`start + 6076

I'll investigate further, but wanted to post in case it's about to be merged

Oct 10 '25 15:10 gabe-l-hart

It looks like it broken in 6589d3b8fcca803e4f2d4ad7da3ff8e87dfaf9ad for me

Oct 10 '25 15:10 gabe-l-hart

Should be ok now. I mistakenly thought that the old mctx of the input would be valid. Let me know if you spot any other issues.

Oct 10 '25 16:10 ggerganov

Confirmed, it's working again for me! I'll test a little further with parallel sequences, but I think it's probably ready

Oct 10 '25 16:10 gabe-l-hart

Hitting assertions with llama-parallel:

lldb ./bin/llama-parallel -- -m $(find-ollama-gguf.sh granite4:micro-h) -ngl 99 -fa on -ns 10 -np 10

main: clearing the KV cache
Client   0, seq    0, junk =    0, prompt = 267, started decoding ...
Client   1, seq    1, junk =    0, prompt = 267, started decoding ...
Client   2, seq    2, junk =    0, prompt = 267, started decoding ...
Client   3, seq    3, junk =    0, prompt = 270, started decoding ...
Client   4, seq    4, junk =    0, prompt = 273, started decoding ...
Client   5, seq    5, junk =    0, prompt = 267, started decoding ...
Client   6, seq    6, junk =    0, prompt = 273, started decoding ...
Client   7, seq    7, junk =    0, prompt = 273, started decoding ...
Client   8, seq    8, junk =    0, prompt = 273, started decoding ...
Client   9, seq    9, junk =    0, prompt = 270, started decoding ...
/Users/ghart/Projects/github/ggml-org/llama.cpp/ggml/src/ggml.c:1648: GGML_ASSERT(view_src == NULL || data_size == 0 || data_size + view_offs <= ggml_nbytes(view_src)) failed
(lldb) process attach --pid 23228
error: attach failed: tried to attach to process already being debugged
Process 23228 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGABRT
    frame #0: 0x00000001a01da388 libsystem_kernel.dylib`__pthread_kill + 8
libsystem_kernel.dylib`__pthread_kill:
->  0x1a01da388 <+8>:  b.lo   0x1a01da3a8    ; <+40>
    0x1a01da38c <+12>: pacibsp 
    0x1a01da390 <+16>: stp    x29, x30, [sp, #-0x10]!
    0x1a01da394 <+20>: mov    x29, sp
Target 0: (llama-parallel) stopped.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGABRT
  * frame #0: 0x00000001a01da388 libsystem_kernel.dylib`__pthread_kill + 8
    frame #1: 0x00000001a021388c libsystem_pthread.dylib`pthread_kill + 296
    frame #2: 0x00000001a011ca3c libsystem_c.dylib`abort + 124
    frame #3: 0x0000000101211554 libggml-base.dylib`ggml_abort(file="/Users/ghart/Projects/github/ggml-org/llama.cpp/ggml/src/ggml.c", line=1648, fmt="GGML_ASSERT(%s) failed") at ggml.c:233:5
    frame #4: 0x0000000101213adc libggml-base.dylib`ggml_new_tensor_impl(ctx=0x00006000005fa040, type=GGML_TYPE_I32, n_dims=1, ne=0x000000016fdf6218, view_src=0x00000001304e0740, view_offs=0) at ggml.c:1648:5
    frame #5: 0x0000000101218e08 libggml-base.dylib`ggml_view_impl(ctx=0x00006000005fa040, a=0x00000001304e0740, n_dims=1, ne=0x000000016fdf6218, offset=0) at ggml.c:3477:35
    frame #6: 0x0000000101218dac libggml-base.dylib`ggml_view_1d(ctx=0x00006000005fa040, a=0x00000001304e0740, ne0=8, offset=0) at ggml.c:3495:35
    frame #7: 0x00000001016afe0c libllama.dylib`build_rs_inp_impl(ctx0=0x00006000005fa040, ubatch=0x000000016fdfaa08, mctx_cur=0x000060000354b430) at llama-graph.cpp:1839:25
    frame #8: 0x00000001016b0398 libllama.dylib`llm_graph_context::build_inp_mem_hybrid(this=0x000000013bf1f1d0) const at llama-graph.cpp:1910:21
    frame #9: 0x00000001017ea7dc libllama.dylib`llm_build_granite_hybrid::llm_build_granite_hybrid(this=0x000000013bf1f1d0, model=0x000000013e020e00, params=0x000000016fdf6c28) at llama-model.cpp:16190:22
    frame #10: 0x00000001017ea6b8 libllama.dylib`llm_build_granite_hybrid::llm_build_granite_hybrid(this=0x000000013bf1f1d0, model=0x000000013e020e00, params=0x000000016fdf6c28) at llama-model.cpp:16180:41
    frame #11: 0x0000000101775774 libllama.dylib`std::__1::__unique_if<llm_build_granite_hybrid>::__unique_single std::__1::make_unique[abi:ne190102]<llm_build_granite_hybrid, llama_model const&, llm_graph_params const&>(__args=0x000000013e020e00, __args=0x000000016fdf6c28) at unique_ptr.h:635:30
    frame #12: 0x0000000101770c48 libllama.dylib`llama_model::build_graph(this=0x000000013e020e00, params=0x000000016fdf6c28) const at llama-model.cpp:19824:23
    frame #13: 0x000000010164b180 libllama.dylib`llama_context::process_ubatch(this=0x0000000120a04080, ubatch=0x000000013bf1e1a0, gtype=LLM_GRAPH_TYPE_DECODER, mctx=0x0000600000369f40, ret=0x000000016fdfad74) at llama-context.cpp:758:20
    frame #14: 0x000000010164cb24 libllama.dylib`llama_context::decode(this=0x0000000120a04080, batch_inp=0x000000016fdfb600) at llama-context.cpp:1088:28
    frame #15: 0x0000000101652a68 libllama.dylib`llama_decode(ctx=0x0000000120a04080, batch=llama_batch @ 0x000000016fdfb600) at llama-context.cpp:2747:26
    frame #16: 0x000000010000410c llama-parallel`main(argc=11, argv=0x000000016fdfd3e0) at parallel.cpp:402:29
    frame #17: 0x000000019fe72b98 dyld`start + 6076

Oct 10 '25 16:10 gabe-l-hart

Just confirmed that I don't hit these on master (81086cd6a)

Oct 10 '25 16:10 gabe-l-hart

Running cleanly with those reverts

Oct 10 '25 16:10 gabe-l-hart

In case it's helpful, I was seeing it consistently on the second call to build_inp_mem_hybrid during the parallel portion of the test

debug logs

llama_kv_cache: size =  352.00 MiB (  4096 cells,   4 layers, 11/11 seqs), K (f16):  176.00 MiB, V (f16):  176.00 MiB
llama_memory_recurrent:      Metal RS buffer size =   811.72 MiB
llama_memory_recurrent: size =  811.72 MiB (    11 cells,  40 layers, 11 seqs), R (f32):   19.72 MiB, S (f32):  792.00 MiB
Process 35994 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 2.1
    frame #0: 0x00000001016b036c libllama.dylib`llm_graph_context::build_inp_mem_hybrid(this=0x000000013380a160) const at llama-graph.cpp:1910:44
   1907	llm_graph_input_mem_hybrid * llm_graph_context::build_inp_mem_hybrid() const {
   1908	    const auto * mctx_cur = static_cast<const llama_memory_hybrid_context *>(mctx);
   1909	
-> 1910	    auto inp_rs   = build_rs_inp_impl     (ctx0, ubatch, mctx_cur->get_recr());
   1911	    auto inp_attn = build_attn_inp_kv_impl(ctx0, ubatch, hparams, cparams, mctx_cur->get_attn());
   1912	
   1913	    auto inp = std::make_unique<llm_graph_input_mem_hybrid>(cparams, std::move(inp_attn), std::move(inp_rs), mctx_cur);
Target 0: (llama-parallel) stopped.
(lldb) c
Process 35994 resuming
Process 35994 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 2.1
    frame #0: 0x00000001016b036c libllama.dylib`llm_graph_context::build_inp_mem_hybrid(this=0x0000000132e09b70) const at llama-graph.cpp:1910:44
   1907	llm_graph_input_mem_hybrid * llm_graph_context::build_inp_mem_hybrid() const {
   1908	    const auto * mctx_cur = static_cast<const llama_memory_hybrid_context *>(mctx);
   1909	
-> 1910	    auto inp_rs   = build_rs_inp_impl     (ctx0, ubatch, mctx_cur->get_recr());
   1911	    auto inp_attn = build_attn_inp_kv_impl(ctx0, ubatch, hparams, cparams, mctx_cur->get_attn());
   1912	
   1913	    auto inp = std::make_unique<llm_graph_input_mem_hybrid>(cparams, std::move(inp_attn), std::move(inp_rs), mctx_cur);
Target 0: (llama-parallel) stopped.
(lldb) c
Process 35994 resuming
Process 35994 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 2.1
    frame #0: 0x00000001016b036c libllama.dylib`llm_graph_context::build_inp_mem_hybrid(this=0x0000000132e09b70) const at llama-graph.cpp:1910:44
   1907	llm_graph_input_mem_hybrid * llm_graph_context::build_inp_mem_hybrid() const {
   1908	    const auto * mctx_cur = static_cast<const llama_memory_hybrid_context *>(mctx);
   1909	
-> 1910	    auto inp_rs   = build_rs_inp_impl     (ctx0, ubatch, mctx_cur->get_recr());
   1911	    auto inp_attn = build_attn_inp_kv_impl(ctx0, ubatch, hparams, cparams, mctx_cur->get_attn());
   1912	
   1913	    auto inp = std::make_unique<llm_graph_input_mem_hybrid>(cparams, std::move(inp_attn), std::move(inp_rs), mctx_cur);
Target 0: (llama-parallel) stopped.
(lldb) c
Process 35994 resuming
llama_context:      Metal compute buffer size =   256.67 MiB
llama_context:        CPU compute buffer size =    15.05 MiB
llama_context: graph nodes  = 2303
llama_context: graph splits = 3
common_init_from_params: added <|end_of_text|> logit bias = -inf
common_init_from_params: added <|fim_pad|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
Process 35994 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 2.1
    frame #0: 0x00000001016b036c libllama.dylib`llm_graph_context::build_inp_mem_hybrid(this=0x0000000102f04080) const at llama-graph.cpp:1910:44
   1907	llm_graph_input_mem_hybrid * llm_graph_context::build_inp_mem_hybrid() const {
   1908	    const auto * mctx_cur = static_cast<const llama_memory_hybrid_context *>(mctx);
   1909	
-> 1910	    auto inp_rs   = build_rs_inp_impl     (ctx0, ubatch, mctx_cur->get_recr());
   1911	    auto inp_attn = build_attn_inp_kv_impl(ctx0, ubatch, hparams, cparams, mctx_cur->get_attn());
   1912	
   1913	    auto inp = std::make_unique<llm_graph_input_mem_hybrid>(cparams, std::move(inp_attn), std::move(inp_rs), mctx_cur);
Target 0: (llama-parallel) stopped.
(lldb) c
Process 35994 resuming
2025-10-10 10:41:04.730362-0600 llama-parallel[35994:106223077] flock failed to lock list file (/var/folders/20/4th8f1dj2t15_21ygkdhskdc0000gn/C//com.apple.metal/16777235_419/functions.list): errno = 35
No new questions so proceed with build-in defaults.


main: initializing samplers with different RNG seeds, starting from -1
main: Simulating parallel requests from clients:
main: n_parallel = 10, n_sequences = 10, cont_batching = 1, system tokens = 256

Processing requests ...

main: clearing the KV cache
Client   0, seq    0, junk =    0, prompt = 267, started decoding ...
Client   1, seq    1, junk =    0, prompt = 267, started decoding ...
Client   2, seq    2, junk =    0, prompt = 267, started decoding ...
Client   3, seq    3, junk =    0, prompt = 270, started decoding ...
Client   4, seq    4, junk =    0, prompt = 273, started decoding ...
Client   5, seq    5, junk =    0, prompt = 267, started decoding ...
Client   6, seq    6, junk =    0, prompt = 273, started decoding ...
Client   7, seq    7, junk =    0, prompt = 273, started decoding ...
Client   8, seq    8, junk =    0, prompt = 273, started decoding ...
Client   9, seq    9, junk =    0, prompt = 270, started decoding ...
Process 35994 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 2.1
    frame #0: 0x00000001016b036c libllama.dylib`llm_graph_context::build_inp_mem_hybrid(this=0x0000000132f0c4b0) const at llama-graph.cpp:1910:44
   1907	llm_graph_input_mem_hybrid * llm_graph_context::build_inp_mem_hybrid() const {
   1908	    const auto * mctx_cur = static_cast<const llama_memory_hybrid_context *>(mctx);
   1909	
-> 1910	    auto inp_rs   = build_rs_inp_impl     (ctx0, ubatch, mctx_cur->get_recr());
   1911	    auto inp_attn = build_attn_inp_kv_impl(ctx0, ubatch, hparams, cparams, mctx_cur->get_attn());
   1912	
   1913	    auto inp = std::make_unique<llm_graph_input_mem_hybrid>(cparams, std::move(inp_attn), std::move(inp_rs), mctx_cur);
Target 0: (llama-parallel) stopped.
(lldb) c
Process 35994 resuming
Process 35994 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 2.1
    frame #0: 0x00000001016b036c libllama.dylib`llm_graph_context::build_inp_mem_hybrid(this=0x0000000132f0c4b0) const at llama-graph.cpp:1910:44
   1907	llm_graph_input_mem_hybrid * llm_graph_context::build_inp_mem_hybrid() const {
   1908	    const auto * mctx_cur = static_cast<const llama_memory_hybrid_context *>(mctx);
   1909	
-> 1910	    auto inp_rs   = build_rs_inp_impl     (ctx0, ubatch, mctx_cur->get_recr());
   1911	    auto inp_attn = build_attn_inp_kv_impl(ctx0, ubatch, hparams, cparams, mctx_cur->get_attn());
   1912	
   1913	    auto inp = std::make_unique<llm_graph_input_mem_hybrid>(cparams, std::move(inp_attn), std::move(inp_rs), mctx_cur);
Target 0: (llama-parallel) stopped.
(lldb) c
Process 35994 resuming
/Users/ghart/Projects/github/ggml-org/llama.cpp/ggml/src/ggml.c:1648: GGML_ASSERT(view_src == NULL || data_size == 0 || data_size + view_offs <= ggml_nbytes(view_src)) failed
(lldb) process attach --pid 35994
error: attach failed: tried to attach to process already being debugged
Process 35994 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
    frame #0: 0x0000000101211550 libggml-base.dylib`ggml_abort(file="/Users/ghart/Projects/github/ggml-org/llama.cpp/ggml/src/ggml.c", line=1648, fmt="GGML_ASSERT(%s) failed") at ggml.c:233:5
   230 	        ggml_print_backtrace();
   231 	    }
   232 	
-> 233 	    abort();
   234 	}
   235 	
   236 	// ggml_print_backtrace is registered with std::set_terminate by ggml.cpp
Target 0: (llama-parallel) stopped.

Oct 10 '25 16:10 gabe-l-hart

So the change in 00f115fe810815d4a22a6dee0acc346131e970e1 does not work for some reason. We want to eventually extract the state of the recurrent memory into the memory context as we do with the KV cache implementations. But I think there is something being mutated when it should not be. For now, let's revert this and figure it out later.

To clarify, the design is that when building the graph we should only reference data that is stored in the memory context (i.e. in llama_memory_recurrent_context), and not in the memory itself (i.e. in llama_memory_recurrent). Except for some constant members such as the ggml tensors for example.

Oct 10 '25 16:10 ggerganov

Got it, that makes sense.

Oct 10 '25 16:10 gabe-l-hart

graph : reuse SSM graphs

Without graph reuse

With graph reuse

Metal

Reuse on, fa on

Reuse on, fa off

Reuse off, fa on

Reuse off, fa off