vllm
vllm copied to clipboard
Add Encoder-decoder model support and T5 Model support
Add support for encoder-decoder models and T5 as an example. There are mainly two differences between enc-dec and decoder-only models.
- The KVCache blocks of prompt and generated tokens are seperated.
- The generated tokens always start with <decoder_start_token_id>, and this token is added during the prompt phase.
T5 has a custom bias in its attention, so I also added a custom pias argument to the cuda kernel.
I can run the t5-small model successfully, but the outputs of t5-large and larger models become NaN at some point. I am still digging into this issue. Also, t5-small is only 20% faster than transformers on my machine. So it's likely that there a lot of room for performance improvement.
FIX #8036
Thanks @js8544, will review
Thanks @js8544 ! Taking a look
The issue of NaN may arise due to numerical overflow when using FP16 for computation in these models(T5-Large).
@js8544 please review this PR against your feature branch
https://github.com/js8544/vllm/pull/1
it adds a t5 encoder/decoder example file, and also finishes merging upstream main into your PR.
@afeldman-nm left a few specific comments. My overarching reaction is as follows:
The implementation does not modify any of the input_metadata
or kv_cache
allocation logic by putting the cross attention kvs and self-attention kvs in the same blocktable. This makes the attention
and prepare_decode
functions much more complicated because we have to keep around the block tables in the model + constantly pad things
So my question is:
- Should we continue down this path or should we adjust and keep 2 separate block tables for the cross and self attention separately? Then make KV cache allocations to each of these explicitly and pass around the 2 different block tables explicitly
I don't mind having two separate block tables. I chose to have them in one because it would make minimal change to existing components. In fact it only adds a couple of ifs in ModelRunner
and all other changes are limited to the enc-dec implementation. IIUC having them separate would also require changes to Scheduler
, Sequence
, InputMetadata
.
Also cc @zhuohan123 for ideas and comments.
Note: I was wrong about this breaking decoders that was a silly comment
@js8544 Its a good point. Perhaps we can make the indexing into the block tables more transparent about what is going on to make the code easier to follow instead. We will think more about it
I agree. The current indexing scheme (by computing paddings each time) is painful and hard to read.
The issue of NaN may arise due to numerical overflow when using FP16 for computation in these models(T5-Large).
Yeah I noticed that transformers's original implementation of T5 also suffers from this. Using BF16 or FP32 should work.
@zhuohan123 @WoosukKwon Would you guys mind taking a look at this PR? T5 seems to be working now.
FYI I think this PR has some conflicts with recent changes to the main branch. I am looking at resolving them.
This PR was previously passing all of the tests so I am hoping once the recent conflicts are resolved then we will be ready to merge @zhuohan123 @WoosukKwon
Hi,
Can this feature be extended to T5-3b/11b and flanT5-xl/xxl models as well. We observe some errors with both these cases.
FYI I think this PR has some conflicts with recent changes to the main branch. I am looking at resolving them.
This PR was previously passing all of the tests so I am hoping once the recent conflicts are resolved then we will be ready to merge @zhuohan123 @WoosukKwon
Hello, Any update on fixing the conflict and merge?
Hello @Abineshik yes things are moving apace. Thanks for checking in.
I determined it is probably for the best for encoder decoder models to have separate blocktables for self- and cross-attention KVs. As opposed to packing all KVs into a single blocktable, which had been the existing approach. Having two blocktables makes it easier to work with the vLLM Attention wrapper when implementing Encoder self-attention and Decoder cross-attention. The indexing arithmetic becomes cleaner.
I am almost finished making this change, then I will update the PR.
@afeldman-nm how is the change you are working on going?
@afeldman-nm how is the change you are working on going?
Work is still ongoing but hope to finish soon!