Yifan Qiao
Yifan Qiao
A comment on Reddit mentioned that Ollama and LM Studio are actually based on llama.cpp with minor modifications. So maybe we can look into the delta there and hopefully the...
Thanks for the issue. I edited the log format a little bit for better presentation. This could be related to FP8 data type. Please use BF16 or FP16 while we...
Thanks for digging into the code! We totally agree that quantization is a must. We'd love to collaborate if you are interested in helping with the integration. Please feel free...
Thanks @alecngo! I would suggest trying different IPC_NAMEs for different engines. `kvcached_mem_info` is the default IPC name and could get conflicts. This could be because kvcached assumes co-running engines must...
Oh I think our version detection does not cover 0.8.5.post1 but that could be quickly fixed. Will update shortly
Thank you @inforly for the issue! This should be fixed by #194. We'd appreciate it if you could give it another try. The fix is currently on the main branch,...
Great to see it runs! I assume you set a static gpu_mem_utilization (e.g., 0.4 for each model) when disabling kvcached. If so, similar performance is expected, because the main goal...
Thank you @robertgshaw2-redhat! Have followed up over email and look forward to exploring potential integration together.
Thanks for the issue! This seems like kvcached thought there is available memory at first but found memory becomes insufficient in actual allocation. To help debug, could you let us...
Replied in #197. I agree they are the same issue and I still highly suspect this is a race condition. Will follow up there.