vllm Add Encoder-decoder model support and T5 Model support

Add support for encoder-decoder models and T5 as an example. There are mainly two differences between enc-dec and decoder-only models.

The KVCache blocks of prompt and generated tokens are seperated.
The generated tokens always start with <decoder_start_token_id>, and this token is added during the prompt phase.

T5 has a custom bias in its attention, so I also added a custom pias argument to the cuda kernel.

I can run the t5-small model successfully, but the outputs of t5-large and larger models become NaN at some point. I am still digging into this issue. Also, t5-small is only 20% faster than transformers on my machine. So it's likely that there a lot of room for performance improvement.

FIX #8036

Feb 29 '24 15:02 js8544

Thanks @js8544, will review

Feb 29 '24 16:02 robertgshaw2-redhat

Thanks @js8544 ! Taking a look

Feb 29 '24 16:02 afeldman-nm

The issue of NaN may arise due to numerical overflow when using FP16 for computation in these models(T5-Large).

Mar 01 '24 09:03 seanxcwang

@js8544 please review this PR against your feature branch

https://github.com/js8544/vllm/pull/1

it adds a t5 encoder/decoder example file, and also finishes merging upstream main into your PR.

Mar 01 '24 22:03 afeldman-nm

@afeldman-nm left a few specific comments. My overarching reaction is as follows:

The implementation does not modify any of the input_metadata or kv_cache allocation logic by putting the cross attention kvs and self-attention kvs in the same blocktable. This makes the attention and prepare_decode functions much more complicated because we have to keep around the block tables in the model + constantly pad things

So my question is:

Should we continue down this path or should we adjust and keep 2 separate block tables for the cross and self attention separately? Then make KV cache allocations to each of these explicitly and pass around the 2 different block tables explicitly

Mar 05 '24 01:03 robertgshaw2-redhat

I don't mind having two separate block tables. I chose to have them in one because it would make minimal change to existing components. In fact it only adds a couple of ifs in ModelRunner and all other changes are limited to the enc-dec implementation. IIUC having them separate would also require changes to Scheduler, Sequence, InputMetadata.

Also cc @zhuohan123 for ideas and comments.

Mar 05 '24 05:03 js8544

Note: I was wrong about this breaking decoders that was a silly comment

@js8544 Its a good point. Perhaps we can make the indexing into the block tables more transparent about what is going on to make the code easier to follow instead. We will think more about it

Mar 05 '24 14:03 robertgshaw2-redhat

I agree. The current indexing scheme (by computing paddings each time) is painful and hard to read.

Mar 05 '24 14:03 js8544

The issue of NaN may arise due to numerical overflow when using FP16 for computation in these models(T5-Large).

Yeah I noticed that transformers's original implementation of T5 also suffers from this. Using BF16 or FP32 should work.

Mar 12 '24 15:03 js8544

@zhuohan123 @WoosukKwon Would you guys mind taking a look at this PR? T5 seems to be working now.

Mar 12 '24 15:03 js8544

FYI I think this PR has some conflicts with recent changes to the main branch. I am looking at resolving them.

This PR was previously passing all of the tests so I am hoping once the recent conflicts are resolved then we will be ready to merge @zhuohan123 @WoosukKwon

Mar 12 '24 17:03 afeldman-nm

Hi,

Can this feature be extended to T5-3b/11b and flanT5-xl/xxl models as well. We observe some errors with both these cases.

Mar 23 '24 22:03 VarnithChordia

FYI I think this PR has some conflicts with recent changes to the main branch. I am looking at resolving them.

This PR was previously passing all of the tests so I am hoping once the recent conflicts are resolved then we will be ready to merge @zhuohan123 @WoosukKwon

Hello, Any update on fixing the conflict and merge?

Mar 28 '24 11:03 Abineshik

Hello @Abineshik yes things are moving apace. Thanks for checking in.

I determined it is probably for the best for encoder decoder models to have separate blocktables for self- and cross-attention KVs. As opposed to packing all KVs into a single blocktable, which had been the existing approach. Having two blocktables makes it easier to work with the vLLM Attention wrapper when implementing Encoder self-attention and Decoder cross-attention. The indexing arithmetic becomes cleaner.

I am almost finished making this change, then I will update the PR.

Mar 28 '24 15:03 afeldman-nm

@afeldman-nm how is the change you are working on going?

May 13 '24 01:05 rohithkrn

@afeldman-nm how is the change you are working on going?

Work is still ongoing but hope to finish soon!

May 13 '24 03:05 afeldman-nm

vllm vllm copied to clipboard

Add Encoder-decoder model support and T5 Model support

vllm
vllm copied to clipboard