Cade Daniel
Cade Daniel
I am afk for a few weeks unfortunately. cc @LiuXiaoxuanPKU @sroy745 @njhill vLLM spec decode experts
These are great ideas! Contributions welcome :)
One other idea you should consider is using multi-lora draft model
Sounds good. Btw I don't think we should let users decide the spec method as it gives too much flexibility to impact other users -- should be set by service...
It's challenging to do this because even if you get async draft model, it won't have the latest accepted tokens from the target model during drafting. Probably means need a...
Hello @jiqing-feng . I can review the PR if you can show a performance benefit in some configuration that users will benefit from. Can you demonstrate a performance improvement and...
I feel the improvements are too marginal to support this proposer unfortunately.
Is there a faster cpu model you can try? may also be worth trying 70B base model since the cost ratio is larger
What QPS is this benchmark running? Note my graph shows various QPS and how the speedup reduces over time.
also cc @rkooo567