TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

Does trtllm-serve enables prefix caching automatically with Deepseek-R1?

Open Bihan opened this issue 9 months ago • 1 comments

Does trtllm-serve enables prefix caching automatically ?

I want to serve Deepseek-R1 with prefix caching enabled. I am deploying as follow:

trtllm-serve
          --backend pytorch
          --max_batch_size $MAX_BATCH_SIZE
          --max_num_tokens $MAX_NUM_TOKENS
          --max_seq_len $MAX_SEQ_LENGTH
          --tp_size 8
          --ep_size 4
          --pp_size 1
          deepseek

Bihan avatar Mar 17 '25 16:03 Bihan

@Bihan Hi, pref caching(KV Cache reusing) is still being developed by our engineering team. I would expect that it can get landed into the main branch in the upcoming weeks. After KV Cache reuse gets enabled for DS R1, we will then expose it in trtllm-serve. Pls wait for the updates.

@zhhuang-nv for vis about DS R1 KV Cache reuse topic. @LinPoly @kaiyux about trtllm-serve topic.

Thanks June

juney-nvidia avatar Mar 24 '25 23:03 juney-nvidia

Hi @juney-nvidia , just want to check in whether there's any progress regarding prefix caching for DS R1 model or even trtllm-serve. Thanks!

LuciusMos avatar Apr 29 '25 23:04 LuciusMos

https://github.com/NVIDIA/TensorRT-LLM/pull/3571 We have a PR for this feature. There are some CI tests failed and I am fixing them, thanks.

zhhuang-nv avatar Apr 30 '25 02:04 zhhuang-nv

@zhhuang-nv Thanks for the update! This feature is really critical in my scenario:)

LuciusMos avatar Apr 30 '25 18:04 LuciusMos