llama.cpp
llama.cpp copied to clipboard
CUDA: compress-mode size
cuda 12.8 added the option to specify stronger compression for binaries.
I ran some tests in CI with the new ubuntu 12.8 docker image:
89-real
arch
In this scenario, it appears it is not compressing by default at all?
comp | ggml-cuda.so |
---|---|
none | 64M |
speed (default) | 64M |
balanced | 64M |
size | 18M |
60;61;70;75;80
arches
comp | ggml-cuda.so |
---|---|
none | 994M |
speed (default) | 448M |
balanced | 368M |
size | 127M |
I did not test the runtime load overhead this should incur. But for most ggml-cuda usecases, the processes are usually long(er) lived, so the trade-off seems reasonable to me.