vllm
vllm copied to clipboard
[Feature]: Custom attention masks
Inspired from this paper, we're exploring ways to bootstrap a bidirectional-context LLM from a decoder-only Causal LLM (e.g. llama-3). This is very easy to do in huggingface transformers by passing a custom attention mask.
Looking for guidance on how to make this happen in vLLM? TLDR;
- Compute bidirectional hidden states from prompt.
- Use causal attention for decoding. Help appreciated!