TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

[Feature Request] RelayAttention for efficient inference with a long shared prefix

Open rayleizhu opened this issue 1 year ago • 1 comments

See the preprint here.

It will be useful for few-shot in-context learning scenarios, prefix-tuning finetuned models, and generally those LLM applications with a long shared prefix.

rayleizhu avatar Feb 23 '24 05:02 rayleizhu