ik_llama.cpp
ik_llama.cpp copied to clipboard
Slightly better graph split strategy
This change seems to result in slightly better TG performance with split mode "graph" and tensor overrides. Basically, for TG just remove the forced graph split when combining partial shared expert results.
Here an example of running a 5.5 Thireus quantization of GLM-4.6 on a 2x3090 system with a Ryzen-3995WX CPU. Command line was
./bin/llama-sweep-bench -m $model -t 64 -ngl 100 -sm graph -b 4096 -ub 4096 -n 64 -gr -c 65536 -ctk q8_0 -ctv q8_0