Slightly better graph split strategy

Open ikawrakow opened this issue 3 weeks ago • 0 comments

This change seems to result in slightly better TG performance with split mode "graph" and tensor overrides. Basically, for TG just remove the forced graph split when combining partial shared expert results.

Here an example of running a 5.5 Thireus quantization of GLM-4.6 on a 2x3090 system with a Ryzen-3995WX CPU. Command line was

./bin/llama-sweep-bench -m $model -t 64 -ngl 100 -sm graph -b 4096 -ub 4096 -n 64 -gr -c 65536 -ctk q8_0 -ctv q8_0

Dec 02 '25 08:12 ikawrakow