Lily Liu comments

Results 12 comments of


                                            Lily Liu

Raise error for long prompt

I move forward with the length check. The original code should already handle the `len(prompt) + len(generated) > limit` case, so I only deal with `model limit` and `len(prompt) >...

[RFC]: Automate Speculative Decoding

> It's indeed a good idea to make the speculative system smarter, to be able to automatically adjust according to the serving load and serving data. Along the same direction,...

[RFC]: Automate Speculative Decoding

> @LiuXiaoxuanPKU Thanks a lot for the super helpful info! I am very interested in the Dynamic Speculative Decoding mentioned above, and also found that the existing vllm framework cannot...

[Kernel] Use flashinfer for decoding

Pass the correctness test for single GPU and TP setting. Feel free to take a first pass @rkooo567

[Kernel] Use flashinfer for decoding

> Very clean!! many comments are nits. So it seems like > > 1. not working with prefill > 2. not working with prefix caching > 3. not working with...

[Kernel] Use flashinfer for decoding

> @LiuXiaoxuanPKU Thank you for this PR! Can we support other prefill backend instead of just the flash attention? like XFormers. Yeah will do! Plan to do that in a...

[Kernel] Use flashinfer for decoding

> there's an odd ci failure Seems some package/import error, will take a look tonight

[Kernel] Use flashinfer for decoding

> > there's an odd ci failure > > Seems some package/import error, will take a look tonight Caused by flashinfer support of pytorch 2.3, will update this pr once...

[Kernel] Use flashinfer for decoding

> @LiuXiaoxuanPKU what's the latest supported version from flash infer? Also, is this something we can just simply build it with torch 2.3? The latest support version of python package...

[Kernel] Use flashinfer for decoding

Hi @zhyncs, thanks for the interest and benchmarking, several things here: FlashInfer is not turned on by default, it can only be enabled with environment variable `VLLM_ATTENTION_BACKEND=FLASHINFER`. We don't turn...