TensorRT-LLM Does trtllm-serve enables prefix caching automatically with Deepseek-R1?

Does trtllm-serve enables prefix caching automatically ?

I want to serve Deepseek-R1 with prefix caching enabled. I am deploying as follow:

trtllm-serve
          --backend pytorch
          --max_batch_size $MAX_BATCH_SIZE
          --max_num_tokens $MAX_NUM_TOKENS
          --max_seq_len $MAX_SEQ_LENGTH
          --tp_size 8
          --ep_size 4
          --pp_size 1
          deepseek

Mar 17 '25 16:03 Bihan

@Bihan Hi, pref caching(KV Cache reusing) is still being developed by our engineering team. I would expect that it can get landed into the main branch in the upcoming weeks. After KV Cache reuse gets enabled for DS R1, we will then expose it in trtllm-serve. Pls wait for the updates.

@zhhuang-nv for vis about DS R1 KV Cache reuse topic. @LinPoly @kaiyux about trtllm-serve topic.

Thanks June

Mar 24 '25 23:03 juney-nvidia

Hi @juney-nvidia , just want to check in whether there's any progress regarding prefix caching for DS R1 model or even trtllm-serve. Thanks!

Apr 29 '25 23:04 LuciusMos

https://github.com/NVIDIA/TensorRT-LLM/pull/3571 We have a PR for this feature. There are some CI tests failed and I am fixing them, thanks.

Apr 30 '25 02:04 zhhuang-nv

@zhhuang-nv Thanks for the update! This feature is really critical in my scenario:)

Apr 30 '25 18:04 LuciusMos