TensorRT-LLM
TensorRT-LLM copied to clipboard
PromptTuning can not work with block_reuse
Hi, i found that when i use prompttuning, the block_reuse seems not work.
cuda version : 12.2 TRT-LLM version 0.9.0 deivice: A100 precision: FP16
For Yi-6B with 512 input tokens and 1 output tokens and batch size 32.
- For model without prompt tuning
- disable block_reuse: 0.99iter/s
- enable block_reuse: 3.00iter/s
- for model with prompt tuning
- disable block_reuse: 0.99iter/s
- enable block_reuse: 0.99iter/s
It seems they can not work simultaneously, could you please help to have a look? Thanks!
Yes, it's expected. The prompt tuning can not work with block_reuse now.