TensorRT-LLM [Feature Request] RelayAttention for efficient inference with a long shared prefix

[Feature Request] RelayAttention for efficient inference with a long shared prefix

Open rayleizhu opened this issue 1 year ago • 1 comments

See the preprint here.

It will be useful for few-shot in-context learning scenarios, prefix-tuning finetuned models, and generally those LLM applications with a long shared prefix.

Feb 23 '24 05:02 rayleizhu

TensorRT-LLM TensorRT-LLM copied to clipboard

[Feature Request] RelayAttention for efficient inference with a long shared prefix

TensorRT-LLM
TensorRT-LLM copied to clipboard