TensorRT-LLM
TensorRT-LLM copied to clipboard
[Feature Request] RelayAttention for efficient inference with a long shared prefix
See the preprint here.
It will be useful for few-shot in-context learning scenarios, prefix-tuning finetuned models, and generally those LLM applications with a long shared prefix.