TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

PromptTuning can not work with block_reuse

Open littletomatodonkey opened this issue 1 year ago • 1 comments

Hi, i found that when i use prompttuning, the block_reuse seems not work.

cuda version : 12.2 TRT-LLM version 0.9.0 deivice: A100 precision: FP16

For Yi-6B with 512 input tokens and 1 output tokens and batch size 32.

  • For model without prompt tuning
    • disable block_reuse: 0.99iter/s
    • enable block_reuse: 3.00iter/s
  • for model with prompt tuning
    • disable block_reuse: 0.99iter/s
    • enable block_reuse: 0.99iter/s

It seems they can not work simultaneously, could you please help to have a look? Thanks!

littletomatodonkey avatar Jul 06 '24 12:07 littletomatodonkey

Yes, it's expected. The prompt tuning can not work with block_reuse now.

QiJune avatar Jul 15 '24 08:07 QiJune