Jiarong Xing

Results 4 issues of Jiarong Xing

It would be great to support Ollama with kvcached for local deployment of multiple LLMs.

When the GPU memory is almost full, kvcached can support offloading KV cache to CPU memory or even disks. Do this using CUDA UVM or more application semantics?

Can we add an exmple to demonstrate kvcached with vLLM semantics router? https://vllm-semantic-router.com/ We can run multiple models on one GPU for the router to choose, including the sleep and...