Cade Daniel comments

Results 121 comments of


                                            Cade Daniel

[WIP][Spec Decode] Add multi-proposer support for variable and flexible speculative decoding

I am afk for a few weeks unfortunately. cc @LiuXiaoxuanPKU @sroy745 @njhill vLLM spec decode experts

[Feature]: Multi-Proposers support for speculative decoding.

These are great ideas! Contributions welcome :)

[Feature]: Multi-Proposers support for speculative decoding.

One other idea you should consider is using multi-lora draft model

[Feature]: Multi-Proposers support for speculative decoding.

Sounds good. Btw I don't think we should let users decide the spec method as it gives too much flexibility to impact other users -- should be set by service...

Heterogeneous Speculative Decoding (CPU + GPU)

It's challenging to do this because even if you get async draft model, it won't have the latest accepted tokens from the target model during drafting. Probably means need a...

Heterogeneous Speculative Decoding (CPU + GPU)

Hello @jiqing-feng . I can review the PR if you can show a performance benefit in some configuration that users will benefit from. Can you demonstrate a performance improvement and...

Heterogeneous Speculative Decoding (CPU + GPU)

I feel the improvements are too marginal to support this proposer unfortunately.

Heterogeneous Speculative Decoding (CPU + GPU)

Is there a faster cpu model you can try? may also be worth trying 70B base model since the cost ratio is larger

Heterogeneous Speculative Decoding (CPU + GPU)

What QPS is this benchmark running? Note my graph shows various QPS and how the speedup reduces over time.

[Core] fix _get_num_new_tokens() for _schedule_default()

also cc @rkooo567