llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

CUDA: compress-mode size

Open Green-Sky opened this issue 19 hours ago • 0 comments

cuda 12.8 added the option to specify stronger compression for binaries.

I ran some tests in CI with the new ubuntu 12.8 docker image:

89-real arch

In this scenario, it appears it is not compressing by default at all?

comp ggml-cuda.so
none 64M
speed (default) 64M
balanced 64M
size 18M

60;61;70;75;80 arches

comp ggml-cuda.so
none 994M
speed (default) 448M
balanced 368M
size 127M

I did not test the runtime load overhead this should incur. But for most ggml-cuda usecases, the processes are usually long(er) lived, so the trade-off seems reasonable to me.

Green-Sky avatar Feb 22 '25 18:02 Green-Sky