llama.cpp CUDA: compress-mode size

CUDA: compress-mode size

Open Green-Sky opened this issue 19 hours ago • 0 comments

cuda 12.8 added the option to specify stronger compression for binaries.

I ran some tests in CI with the new ubuntu 12.8 docker image:

`89-real` arch

In this scenario, it appears it is not compressing by default at all?

comp	ggml-cuda.so
none	64M
speed (default)	64M
balanced	64M
size	18M

`60;61;70;75;80` arches

comp	ggml-cuda.so
none	994M
speed (default)	448M
balanced	368M
size	127M

I did not test the runtime load overhead this should incur. But for most ggml-cuda usecases, the processes are usually long(er) lived, so the trade-off seems reasonable to me.

Feb 22 '25 18:02 Green-Sky

llama.cpp llama.cpp copied to clipboard

CUDA: compress-mode size

89-real arch

60;61;70;75;80 arches

llama.cpp
llama.cpp copied to clipboard

`89-real` arch

`60;61;70;75;80` arches