When Running deepseek-r1-dynamic-1.58-bit,the KV cache question
I had 8 * 4090,I want to put the 7 cards on gpu_layers,the KV cache for the free card,how to set?Now the KV on the cpu load is so slowly.
I'm no expert, but I think adjusting the tensor_split setting should fix it. It seems like you should be able to compress tensors across 7 cards and push some tensors and kv cache onto 1 card.
Similar question. Is there any guide on this?
Can the latest llama.cpp now support running the deepseek-r1-dynamic-1.58-bit model, assuming sufficient hardware memory?
This issue was closed because it has been inactive for 14 days since being marked as stale.