llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

graph : reuse SSM graphs

Open ggerganov opened this issue 2 months ago • 18 comments

Not sure if there is a reason not to enable graph reuse for recurrent graphs (mamba, hybrids, SSM, etc.). Did a few tests and seems to work, resulting in some modest perf improvements. cc @gabe-l-hart @compilade

Without graph reuse

make -j && LLAMA_GRAPH_REUSE_DISABLE=1 ./bin/llama-bench -m ../models/mamba-130m/ggml-model-f16.gguf -m ../models/granite-4-h-tiny/ggml-model-q8_0.gguf -m ../models/ai21-jamba-mini-1.7/ggml-model-q8_0.gguf -m ../models/liquidai-lfm2-2.6b/ggml-model-q4_k.gguf -fa 1 -t 1 -n 32
model size params backend ngl threads fa test t/s
mamba 0.1B F16 256.96 MiB 129.14 M Metal 99 1 1 pp512 8415.73 ± 46.47
mamba 0.1B F16 256.96 MiB 129.14 M Metal 99 1 1 tg32 322.74 ± 0.64
granitehybrid ?B Q8_0 6.88 GiB 6.94 B Metal 99 1 1 pp512 2119.36 ± 3.31
granitehybrid ?B Q8_0 6.88 GiB 6.94 B Metal 99 1 1 tg32 77.17 ± 0.11
jamba ?B Q8_0 51.05 GiB 51.57 B Metal 99 1 1 pp512 603.47 ± 1.83
jamba ?B Q8_0 51.05 GiB 51.57 B Metal 99 1 1 tg32 42.35 ± 0.02
lfm2 2.6B Q4_K - Medium 1.45 GiB 2.57 B Metal 99 1 1 pp512 2923.41 ± 3.20
lfm2 2.6B Q4_K - Medium 1.45 GiB 2.57 B Metal 99 1 1 tg32 169.83 ± 0.67
build: 638e2c239 (6725)

With graph reuse

make -j && ./bin/llama-bench -m ../models/mamba-130m/ggml-model-f16.gguf -m ../models/granite-4-h-tiny/ggml-model-q8_0.gguf -m ../models/ai21-jamba-mini-1.7/ggml-model-q8_0.gguf -m ../models/liquidai-lfm2-2.6b/ggml-model-q4_k.gguf -fa 1 -t 1 -n 32
model size params backend ngl threads fa test t/s
mamba 0.1B F16 256.96 MiB 129.14 M Metal 99 1 1 pp512 8453.65 ± 20.10
mamba 0.1B F16 256.96 MiB 129.14 M Metal 99 1 1 tg32 348.83 ± 1.67
granitehybrid ?B Q8_0 6.88 GiB 6.94 B Metal 99 1 1 pp512 2126.12 ± 1.90
granitehybrid ?B Q8_0 6.88 GiB 6.94 B Metal 99 1 1 tg32 82.26 ± 0.13
jamba ?B Q8_0 51.05 GiB 51.57 B Metal 99 1 1 pp512 604.56 ± 2.08
jamba ?B Q8_0 51.05 GiB 51.57 B Metal 99 1 1 tg32 43.22 ± 0.02
lfm2 2.6B Q4_K - Medium 1.45 GiB 2.57 B Metal 99 1 1 pp512 2928.31 ± 1.78
lfm2 2.6B Q4_K - Medium 1.45 GiB 2.57 B Metal 99 1 1 tg32 179.18 ± 0.47
build: 638e2c239 (6725)

ggerganov avatar Oct 09 '25 16:10 ggerganov

Very cool! I'll test shortly with Granite 4.

The only thought I've had about why this might be difficult is around implementing the SSD version of SSM_SCAN. In mamba_ssm and mlx, they conditionally use SSD if (and only if) the cache is empty and the sequence length is >1. Since SSD is composed of a bunch of other smaller ops (tril and cumsum), one way this could be implemented is at the graph building layer which would result in different graphs for different parts of the generate loop. That said, this could also be implemented inside the SSM_SCAN kernel dispatching layer, so I don't think it's a blocker for reusing graphs.

gabe-l-hart avatar Oct 09 '25 17:10 gabe-l-hart

@gabe-l-hart What is SSD?

ggerganov avatar Oct 09 '25 17:10 ggerganov

What is SSD?

Sorry, commenting from my phone at the airport! SSD is the State Space Duality part of the mamba2 paper where they reframe the SSM_SCAN op as an attention operation. The mlx implementation is here and the original triton kernel is here. I'm still working on actually grokking the math and was hoping to try to get it implemented in ggml soon-ish. It should provide a nice performance boost for prefill.

gabe-l-hart avatar Oct 09 '25 17:10 gabe-l-hart

Results looking good for granite4:micro-h (using the GGUF we uploaded to Ollama):


Metal

Reuse on, fa on

./bin/llama-batched-bench -m $(find-ollama-gguf.sh granite4:micro-h) -c 2048 -b 2048 -ub 512 -npp 128,256 -ntg 128 -npl 1,2,4 -ngl 99 -fa on
PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
128 128 1 256 0.166 770.58 1.617 79.16 1.783 143.56
128 128 2 512 0.309 828.03 4.041 63.35 4.350 117.69
128 128 4 1024 0.598 856.59 7.172 71.39 7.770 131.80
256 128 1 384 0.307 834.22 1.628 78.63 1.935 198.48
256 128 2 768 0.593 863.30 4.048 63.24 4.641 165.48
256 128 4 1536 1.191 860.14 7.162 71.48 8.353 183.89

Reuse on, fa off

./bin/llama-batched-bench -m $(find-ollama-gguf.sh granite4:micro-h) -c 2048 -b 2048 -ub 512 -npp 128,256 -ntg 128 -npl 1,2,4 -ngl 99 -fa off
PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
128 128 1 256 0.167 768.56 1.697 75.43 1.863 137.38
128 128 2 512 0.310 826.56 4.130 61.99 4.440 115.32
128 128 4 1024 0.599 854.75 7.232 70.80 7.831 130.76
256 128 1 384 0.307 833.12 1.705 75.08 2.012 190.84
256 128 2 768 0.594 861.28 4.175 61.32 4.770 161.02
256 128 4 1536 1.193 858.02 7.237 70.75 8.430 182.20

Reuse off, fa on

LLAMA_GRAPH_REUSE_DISABLE=1 ./bin/llama-batched-bench -m $(find-ollama-gguf.sh granite4:micro-h) -c 2048 -b 2048 -ub 512 -npp 128,256 -ntg 128 -npl 1,2,4 -ngl 99 -fa on
PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
128 128 1 256 0.166 769.00 1.714 74.69 1.880 136.16
128 128 2 512 0.309 828.93 4.209 60.83 4.517 113.34
128 128 4 1024 0.598 855.49 7.282 70.31 7.881 129.94
256 128 1 384 0.307 834.57 1.763 72.61 2.070 185.55
256 128 2 768 0.593 864.13 4.176 61.30 4.769 161.04
256 128 4 1536 1.190 860.30 7.291 70.22 8.481 181.10

Reuse off, fa off

LLAMA_GRAPH_REUSE_DISABLE=1 ./bin/llama-batched-bench -m $(find-ollama-gguf.sh granite4:micro-h) -c 2048 -b 2048 -ub 512 -npp 128,256 -ntg 128 -npl 1,2,4 -ngl 99 -fa off
PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
128 128 1 256 0.170 751.10 1.790 71.50 1.961 130.58
128 128 2 512 0.310 826.98 4.190 61.10 4.499 113.79
128 128 4 1024 0.602 850.25 7.299 70.15 7.901 129.60
256 128 1 384 0.309 829.05 1.793 71.38 2.102 182.67
256 128 2 768 0.596 858.70 4.220 60.67 4.816 159.46
256 128 4 1536 1.193 858.16 7.309 70.05 8.502 180.66

gabe-l-hart avatar Oct 09 '25 17:10 gabe-l-hart

@gabe-l-hart as a side note, I've added cumsum and tri as new ops during the Qwen3Next implementation, so that might allow for some decoupling.

pwilkin avatar Oct 09 '25 18:10 pwilkin

I've added cumsum and tri as new ops during the Qwen3Next implementation, so that might allow for some decoupling.

@pwilkin I thought I saw that trying to keep up with the comments! It's high on my todo list after this conference to get into your PR (partly selfishly because I want to reuse these parts)

gabe-l-hart avatar Oct 09 '25 18:10 gabe-l-hart

@gabe-l-hart Parallel performance of SSMs should be fixed with #16494

ggerganov avatar Oct 10 '25 07:10 ggerganov

Thank you for digging into these performance improvements!

gabe-l-hart avatar Oct 10 '25 11:10 gabe-l-hart

I'm hitting errors on metal with the most recent changes on this branch:

lldb ./bin/llama-cli -- -m $(find-ollama-gguf.sh granite4:micro-h) -no-cnv -p "tell me a story about a developer and their dog?" -ngl 99 --temp 0
tell me a story about a developer and their dog? The response mustProcess 95451 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x3c)
    frame #0: 0x00000001016f7e00 libllama.dylib`llama_memory_recurrent_context::get_head(this=0x0000000000000000) const at llama-memory-recurrent.cpp:1144:12
   1141	}
   1142	
   1143	uint32_t llama_memory_recurrent_context::get_head() const {
-> 1144	    return head;
   1145	}
   1146	
   1147	int32_t llama_memory_recurrent_context::get_rs_z() const {
Target 0: (llama-cli) stopped.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x3c)
  * frame #0: 0x00000001016f7e00 libllama.dylib`llama_memory_recurrent_context::get_head(this=0x0000000000000000) const at llama-memory-recurrent.cpp:1144:12
    frame #1: 0x00000001016ad078 libllama.dylib`llm_graph_input_mem_hybrid::can_reuse(this=0x0000600000258500, params=0x000000016fdf5b88) at llama-graph.cpp:481:52
    frame #2: 0x00000001016adc60 libllama.dylib`llm_graph_result::can_reuse(this=0x0000000121810600, params=0x000000016fdf5b88) at llama-graph.cpp:565:33
    frame #3: 0x000000010164f0f0 libllama.dylib`llama_context::process_ubatch(this=0x0000000102904080, ubatch=0x0000600000750540, gtype=LLM_GRAPH_TYPE_DECODER, mctx=0x00006000039732c0, ret=0x000000016fdf9cd4) at llama-context.cpp:746:38
    frame #4: 0x0000000101650b24 libllama.dylib`llama_context::decode(this=0x0000000102904080, batch_inp=0x000000016fdfac88) at llama-context.cpp:1088:28
    frame #5: 0x0000000101656a68 libllama.dylib`llama_decode(ctx=0x0000000102904080, batch=llama_batch @ 0x000000016fdfac88) at llama-context.cpp:2747:26
    frame #6: 0x0000000100006fb8 llama-cli`main(argc=10, argv=0x000000016fdfd380) at main.cpp:671:21
    frame #7: 0x000000019fe72b98 dyld`start + 6076

I'll investigate further, but wanted to post in case it's about to be merged

gabe-l-hart avatar Oct 10 '25 15:10 gabe-l-hart

It looks like it broken in 6589d3b8fcca803e4f2d4ad7da3ff8e87dfaf9ad for me

gabe-l-hart avatar Oct 10 '25 15:10 gabe-l-hart

Should be ok now. I mistakenly thought that the old mctx of the input would be valid. Let me know if you spot any other issues.

ggerganov avatar Oct 10 '25 16:10 ggerganov

Confirmed, it's working again for me! I'll test a little further with parallel sequences, but I think it's probably ready

gabe-l-hart avatar Oct 10 '25 16:10 gabe-l-hart

Hitting assertions with llama-parallel:

lldb ./bin/llama-parallel -- -m $(find-ollama-gguf.sh granite4:micro-h) -ngl 99 -fa on -ns 10 -np 10
main: clearing the KV cache
Client   0, seq    0, junk =    0, prompt = 267, started decoding ...
Client   1, seq    1, junk =    0, prompt = 267, started decoding ...
Client   2, seq    2, junk =    0, prompt = 267, started decoding ...
Client   3, seq    3, junk =    0, prompt = 270, started decoding ...
Client   4, seq    4, junk =    0, prompt = 273, started decoding ...
Client   5, seq    5, junk =    0, prompt = 267, started decoding ...
Client   6, seq    6, junk =    0, prompt = 273, started decoding ...
Client   7, seq    7, junk =    0, prompt = 273, started decoding ...
Client   8, seq    8, junk =    0, prompt = 273, started decoding ...
Client   9, seq    9, junk =    0, prompt = 270, started decoding ...
/Users/ghart/Projects/github/ggml-org/llama.cpp/ggml/src/ggml.c:1648: GGML_ASSERT(view_src == NULL || data_size == 0 || data_size + view_offs <= ggml_nbytes(view_src)) failed
(lldb) process attach --pid 23228
error: attach failed: tried to attach to process already being debugged
Process 23228 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGABRT
    frame #0: 0x00000001a01da388 libsystem_kernel.dylib`__pthread_kill + 8
libsystem_kernel.dylib`__pthread_kill:
->  0x1a01da388 <+8>:  b.lo   0x1a01da3a8    ; <+40>
    0x1a01da38c <+12>: pacibsp 
    0x1a01da390 <+16>: stp    x29, x30, [sp, #-0x10]!
    0x1a01da394 <+20>: mov    x29, sp
Target 0: (llama-parallel) stopped.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGABRT
  * frame #0: 0x00000001a01da388 libsystem_kernel.dylib`__pthread_kill + 8
    frame #1: 0x00000001a021388c libsystem_pthread.dylib`pthread_kill + 296
    frame #2: 0x00000001a011ca3c libsystem_c.dylib`abort + 124
    frame #3: 0x0000000101211554 libggml-base.dylib`ggml_abort(file="/Users/ghart/Projects/github/ggml-org/llama.cpp/ggml/src/ggml.c", line=1648, fmt="GGML_ASSERT(%s) failed") at ggml.c:233:5
    frame #4: 0x0000000101213adc libggml-base.dylib`ggml_new_tensor_impl(ctx=0x00006000005fa040, type=GGML_TYPE_I32, n_dims=1, ne=0x000000016fdf6218, view_src=0x00000001304e0740, view_offs=0) at ggml.c:1648:5
    frame #5: 0x0000000101218e08 libggml-base.dylib`ggml_view_impl(ctx=0x00006000005fa040, a=0x00000001304e0740, n_dims=1, ne=0x000000016fdf6218, offset=0) at ggml.c:3477:35
    frame #6: 0x0000000101218dac libggml-base.dylib`ggml_view_1d(ctx=0x00006000005fa040, a=0x00000001304e0740, ne0=8, offset=0) at ggml.c:3495:35
    frame #7: 0x00000001016afe0c libllama.dylib`build_rs_inp_impl(ctx0=0x00006000005fa040, ubatch=0x000000016fdfaa08, mctx_cur=0x000060000354b430) at llama-graph.cpp:1839:25
    frame #8: 0x00000001016b0398 libllama.dylib`llm_graph_context::build_inp_mem_hybrid(this=0x000000013bf1f1d0) const at llama-graph.cpp:1910:21
    frame #9: 0x00000001017ea7dc libllama.dylib`llm_build_granite_hybrid::llm_build_granite_hybrid(this=0x000000013bf1f1d0, model=0x000000013e020e00, params=0x000000016fdf6c28) at llama-model.cpp:16190:22
    frame #10: 0x00000001017ea6b8 libllama.dylib`llm_build_granite_hybrid::llm_build_granite_hybrid(this=0x000000013bf1f1d0, model=0x000000013e020e00, params=0x000000016fdf6c28) at llama-model.cpp:16180:41
    frame #11: 0x0000000101775774 libllama.dylib`std::__1::__unique_if<llm_build_granite_hybrid>::__unique_single std::__1::make_unique[abi:ne190102]<llm_build_granite_hybrid, llama_model const&, llm_graph_params const&>(__args=0x000000013e020e00, __args=0x000000016fdf6c28) at unique_ptr.h:635:30
    frame #12: 0x0000000101770c48 libllama.dylib`llama_model::build_graph(this=0x000000013e020e00, params=0x000000016fdf6c28) const at llama-model.cpp:19824:23
    frame #13: 0x000000010164b180 libllama.dylib`llama_context::process_ubatch(this=0x0000000120a04080, ubatch=0x000000013bf1e1a0, gtype=LLM_GRAPH_TYPE_DECODER, mctx=0x0000600000369f40, ret=0x000000016fdfad74) at llama-context.cpp:758:20
    frame #14: 0x000000010164cb24 libllama.dylib`llama_context::decode(this=0x0000000120a04080, batch_inp=0x000000016fdfb600) at llama-context.cpp:1088:28
    frame #15: 0x0000000101652a68 libllama.dylib`llama_decode(ctx=0x0000000120a04080, batch=llama_batch @ 0x000000016fdfb600) at llama-context.cpp:2747:26
    frame #16: 0x000000010000410c llama-parallel`main(argc=11, argv=0x000000016fdfd3e0) at parallel.cpp:402:29
    frame #17: 0x000000019fe72b98 dyld`start + 6076

gabe-l-hart avatar Oct 10 '25 16:10 gabe-l-hart

Just confirmed that I don't hit these on master (81086cd6a)

gabe-l-hart avatar Oct 10 '25 16:10 gabe-l-hart

Running cleanly with those reverts

gabe-l-hart avatar Oct 10 '25 16:10 gabe-l-hart

In case it's helpful, I was seeing it consistently on the second call to build_inp_mem_hybrid during the parallel portion of the test

debug logs
llama_kv_cache: size =  352.00 MiB (  4096 cells,   4 layers, 11/11 seqs), K (f16):  176.00 MiB, V (f16):  176.00 MiB
llama_memory_recurrent:      Metal RS buffer size =   811.72 MiB
llama_memory_recurrent: size =  811.72 MiB (    11 cells,  40 layers, 11 seqs), R (f32):   19.72 MiB, S (f32):  792.00 MiB
Process 35994 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 2.1
    frame #0: 0x00000001016b036c libllama.dylib`llm_graph_context::build_inp_mem_hybrid(this=0x000000013380a160) const at llama-graph.cpp:1910:44
   1907	llm_graph_input_mem_hybrid * llm_graph_context::build_inp_mem_hybrid() const {
   1908	    const auto * mctx_cur = static_cast<const llama_memory_hybrid_context *>(mctx);
   1909	
-> 1910	    auto inp_rs   = build_rs_inp_impl     (ctx0, ubatch, mctx_cur->get_recr());
   1911	    auto inp_attn = build_attn_inp_kv_impl(ctx0, ubatch, hparams, cparams, mctx_cur->get_attn());
   1912	
   1913	    auto inp = std::make_unique<llm_graph_input_mem_hybrid>(cparams, std::move(inp_attn), std::move(inp_rs), mctx_cur);
Target 0: (llama-parallel) stopped.
(lldb) c
Process 35994 resuming
Process 35994 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 2.1
    frame #0: 0x00000001016b036c libllama.dylib`llm_graph_context::build_inp_mem_hybrid(this=0x0000000132e09b70) const at llama-graph.cpp:1910:44
   1907	llm_graph_input_mem_hybrid * llm_graph_context::build_inp_mem_hybrid() const {
   1908	    const auto * mctx_cur = static_cast<const llama_memory_hybrid_context *>(mctx);
   1909	
-> 1910	    auto inp_rs   = build_rs_inp_impl     (ctx0, ubatch, mctx_cur->get_recr());
   1911	    auto inp_attn = build_attn_inp_kv_impl(ctx0, ubatch, hparams, cparams, mctx_cur->get_attn());
   1912	
   1913	    auto inp = std::make_unique<llm_graph_input_mem_hybrid>(cparams, std::move(inp_attn), std::move(inp_rs), mctx_cur);
Target 0: (llama-parallel) stopped.
(lldb) c
Process 35994 resuming
Process 35994 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 2.1
    frame #0: 0x00000001016b036c libllama.dylib`llm_graph_context::build_inp_mem_hybrid(this=0x0000000132e09b70) const at llama-graph.cpp:1910:44
   1907	llm_graph_input_mem_hybrid * llm_graph_context::build_inp_mem_hybrid() const {
   1908	    const auto * mctx_cur = static_cast<const llama_memory_hybrid_context *>(mctx);
   1909	
-> 1910	    auto inp_rs   = build_rs_inp_impl     (ctx0, ubatch, mctx_cur->get_recr());
   1911	    auto inp_attn = build_attn_inp_kv_impl(ctx0, ubatch, hparams, cparams, mctx_cur->get_attn());
   1912	
   1913	    auto inp = std::make_unique<llm_graph_input_mem_hybrid>(cparams, std::move(inp_attn), std::move(inp_rs), mctx_cur);
Target 0: (llama-parallel) stopped.
(lldb) c
Process 35994 resuming
llama_context:      Metal compute buffer size =   256.67 MiB
llama_context:        CPU compute buffer size =    15.05 MiB
llama_context: graph nodes  = 2303
llama_context: graph splits = 3
common_init_from_params: added <|end_of_text|> logit bias = -inf
common_init_from_params: added <|fim_pad|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
Process 35994 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 2.1
    frame #0: 0x00000001016b036c libllama.dylib`llm_graph_context::build_inp_mem_hybrid(this=0x0000000102f04080) const at llama-graph.cpp:1910:44
   1907	llm_graph_input_mem_hybrid * llm_graph_context::build_inp_mem_hybrid() const {
   1908	    const auto * mctx_cur = static_cast<const llama_memory_hybrid_context *>(mctx);
   1909	
-> 1910	    auto inp_rs   = build_rs_inp_impl     (ctx0, ubatch, mctx_cur->get_recr());
   1911	    auto inp_attn = build_attn_inp_kv_impl(ctx0, ubatch, hparams, cparams, mctx_cur->get_attn());
   1912	
   1913	    auto inp = std::make_unique<llm_graph_input_mem_hybrid>(cparams, std::move(inp_attn), std::move(inp_rs), mctx_cur);
Target 0: (llama-parallel) stopped.
(lldb) c
Process 35994 resuming
2025-10-10 10:41:04.730362-0600 llama-parallel[35994:106223077] flock failed to lock list file (/var/folders/20/4th8f1dj2t15_21ygkdhskdc0000gn/C//com.apple.metal/16777235_419/functions.list): errno = 35
No new questions so proceed with build-in defaults.


main: initializing samplers with different RNG seeds, starting from -1
main: Simulating parallel requests from clients:
main: n_parallel = 10, n_sequences = 10, cont_batching = 1, system tokens = 256

Processing requests ...

main: clearing the KV cache
Client   0, seq    0, junk =    0, prompt = 267, started decoding ...
Client   1, seq    1, junk =    0, prompt = 267, started decoding ...
Client   2, seq    2, junk =    0, prompt = 267, started decoding ...
Client   3, seq    3, junk =    0, prompt = 270, started decoding ...
Client   4, seq    4, junk =    0, prompt = 273, started decoding ...
Client   5, seq    5, junk =    0, prompt = 267, started decoding ...
Client   6, seq    6, junk =    0, prompt = 273, started decoding ...
Client   7, seq    7, junk =    0, prompt = 273, started decoding ...
Client   8, seq    8, junk =    0, prompt = 273, started decoding ...
Client   9, seq    9, junk =    0, prompt = 270, started decoding ...
Process 35994 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 2.1
    frame #0: 0x00000001016b036c libllama.dylib`llm_graph_context::build_inp_mem_hybrid(this=0x0000000132f0c4b0) const at llama-graph.cpp:1910:44
   1907	llm_graph_input_mem_hybrid * llm_graph_context::build_inp_mem_hybrid() const {
   1908	    const auto * mctx_cur = static_cast<const llama_memory_hybrid_context *>(mctx);
   1909	
-> 1910	    auto inp_rs   = build_rs_inp_impl     (ctx0, ubatch, mctx_cur->get_recr());
   1911	    auto inp_attn = build_attn_inp_kv_impl(ctx0, ubatch, hparams, cparams, mctx_cur->get_attn());
   1912	
   1913	    auto inp = std::make_unique<llm_graph_input_mem_hybrid>(cparams, std::move(inp_attn), std::move(inp_rs), mctx_cur);
Target 0: (llama-parallel) stopped.
(lldb) c
Process 35994 resuming
Process 35994 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 2.1
    frame #0: 0x00000001016b036c libllama.dylib`llm_graph_context::build_inp_mem_hybrid(this=0x0000000132f0c4b0) const at llama-graph.cpp:1910:44
   1907	llm_graph_input_mem_hybrid * llm_graph_context::build_inp_mem_hybrid() const {
   1908	    const auto * mctx_cur = static_cast<const llama_memory_hybrid_context *>(mctx);
   1909	
-> 1910	    auto inp_rs   = build_rs_inp_impl     (ctx0, ubatch, mctx_cur->get_recr());
   1911	    auto inp_attn = build_attn_inp_kv_impl(ctx0, ubatch, hparams, cparams, mctx_cur->get_attn());
   1912	
   1913	    auto inp = std::make_unique<llm_graph_input_mem_hybrid>(cparams, std::move(inp_attn), std::move(inp_rs), mctx_cur);
Target 0: (llama-parallel) stopped.
(lldb) c
Process 35994 resuming
/Users/ghart/Projects/github/ggml-org/llama.cpp/ggml/src/ggml.c:1648: GGML_ASSERT(view_src == NULL || data_size == 0 || data_size + view_offs <= ggml_nbytes(view_src)) failed
(lldb) process attach --pid 35994
error: attach failed: tried to attach to process already being debugged
Process 35994 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
    frame #0: 0x0000000101211550 libggml-base.dylib`ggml_abort(file="/Users/ghart/Projects/github/ggml-org/llama.cpp/ggml/src/ggml.c", line=1648, fmt="GGML_ASSERT(%s) failed") at ggml.c:233:5
   230 	        ggml_print_backtrace();
   231 	    }
   232 	
-> 233 	    abort();
   234 	}
   235 	
   236 	// ggml_print_backtrace is registered with std::set_terminate by ggml.cpp
Target 0: (llama-parallel) stopped.

gabe-l-hart avatar Oct 10 '25 16:10 gabe-l-hart

So the change in 00f115fe810815d4a22a6dee0acc346131e970e1 does not work for some reason. We want to eventually extract the state of the recurrent memory into the memory context as we do with the KV cache implementations. But I think there is something being mutated when it should not be. For now, let's revert this and figure it out later.

To clarify, the design is that when building the graph we should only reference data that is stored in the memory context (i.e. in llama_memory_recurrent_context), and not in the memory itself (i.e. in llama_memory_recurrent). Except for some constant members such as the ggml tensors for example.

ggerganov avatar Oct 10 '25 16:10 ggerganov

Got it, that makes sense.

gabe-l-hart avatar Oct 10 '25 16:10 gabe-l-hart