Cade Daniel comments

Results 121 comments of


                                            Cade Daniel

[Misc]: Implement CPU/GPU swapping in BlockManagerV2

SG, I will take a look by Monday

[Misc]: Implement CPU/GPU swapping in BlockManagerV2

I retried the AMD ones for you. Best way is to push an empty commit to restart things. If it keeps happening with AMD let's see if we should auto...

[Misc]: Implement CPU/GPU swapping in BlockManagerV2

Yeah LGTM, let's get it merged

[Bugfix] fix func var in cpuworker.execute_model() [bug 4568]

thanks for the fix !

[Usage]: doubt on computational complexity

btw if you're interested in fixing this, see https://github.com/vllm-project/vllm/issues/4536

[Speculative decoding] [Help wanted] [Performance] Optimize draft-model speculative decoding

See the code linked here @youkaichao : https://github.com/vllm-project/vllm/issues/4632. The spec worker and non-spec workers share the same process.

[Speculative decoding] [Help wanted] [Performance] Optimize draft-model speculative decoding

> About the tree-attention/Medusa/Eagle, one of the core implementation will be tree attention mask in flash attention, which is currently not ready. I'd like to bring your attention to it...

[Speculative decoding] [Help wanted] [Performance] Optimize draft-model speculative decoding

@sighingnow this issue is for getting the 50% speedup. once the P0s are done we will get it with temperature 1.0.

[Speculative decoding] [Help wanted] [Performance] Optimize draft-model speculative decoding

> May I know more about the accept rate when we get the 50% speedup? Thanks! On llama2 7b / llama2 70b, the acceptance rate was like 80% (no fine...

[core] Add opt-in flag for Windows and OSX clusters, update `ray start` output to match docs

I think this breaks master https://github.com/ray-project/ray/issues/32389