Does trtllm-serve enables prefix caching automatically with Deepseek-R1?
Does trtllm-serve enables prefix caching automatically ?
I want to serve Deepseek-R1 with prefix caching enabled. I am deploying as follow:
trtllm-serve
--backend pytorch
--max_batch_size $MAX_BATCH_SIZE
--max_num_tokens $MAX_NUM_TOKENS
--max_seq_len $MAX_SEQ_LENGTH
--tp_size 8
--ep_size 4
--pp_size 1
deepseek
@Bihan Hi, pref caching(KV Cache reusing) is still being developed by our engineering team. I would expect that it can get landed into the main branch in the upcoming weeks. After KV Cache reuse gets enabled for DS R1, we will then expose it in trtllm-serve. Pls wait for the updates.
@zhhuang-nv for vis about DS R1 KV Cache reuse topic. @LinPoly @kaiyux about trtllm-serve topic.
Thanks June
Hi @juney-nvidia , just want to check in whether there's any progress regarding prefix caching for DS R1 model or even trtllm-serve. Thanks!
https://github.com/NVIDIA/TensorRT-LLM/pull/3571 We have a PR for this feature. There are some CI tests failed and I am fixing them, thanks.
@zhhuang-nv Thanks for the update! This feature is really critical in my scenario:)