Jiarong Xing
Results
4
issues of
Jiarong Xing
It would be great to support Ollama with kvcached for local deployment of multiple LLMs.
When the GPU memory is almost full, kvcached can support offloading KV cache to CPU memory or even disks. Do this using CUDA UVM or more application semantics?
Can we add an exmple to demonstrate kvcached with vLLM semantics router? https://vllm-semantic-router.com/ We can run multiple models on one GPU for the router to choose, including the sleep and...