Lily Liu
Lily Liu
I move forward with the length check. The original code should already handle the `len(prompt) + len(generated) > limit` case, so I only deal with `model limit` and `len(prompt) >...
> It's indeed a good idea to make the speculative system smarter, to be able to automatically adjust according to the serving load and serving data. Along the same direction,...
> @LiuXiaoxuanPKU Thanks a lot for the super helpful info! I am very interested in the Dynamic Speculative Decoding mentioned above, and also found that the existing vllm framework cannot...
Pass the correctness test for single GPU and TP setting. Feel free to take a first pass @rkooo567
> Very clean!! many comments are nits. So it seems like > > 1. not working with prefill > 2. not working with prefix caching > 3. not working with...
> @LiuXiaoxuanPKU Thank you for this PR! Can we support other prefill backend instead of just the flash attention? like XFormers. Yeah will do! Plan to do that in a...
> there's an odd ci failure Seems some package/import error, will take a look tonight
> > there's an odd ci failure > > Seems some package/import error, will take a look tonight Caused by flashinfer support of pytorch 2.3, will update this pr once...
> @LiuXiaoxuanPKU what's the latest supported version from flash infer? Also, is this something we can just simply build it with torch 2.3? The latest support version of python package...
Hi @zhyncs, thanks for the interest and benchmarking, several things here: FlashInfer is not turned on by default, it can only be enabled with environment variable `VLLM_ATTENTION_BACKEND=FLASHINFER`. We don't turn...