Cade Daniel
Cade Daniel
SG, I will take a look by Monday
I retried the AMD ones for you. Best way is to push an empty commit to restart things. If it keeps happening with AMD let's see if we should auto...
Yeah LGTM, let's get it merged
thanks for the fix !
btw if you're interested in fixing this, see https://github.com/vllm-project/vllm/issues/4536
See the code linked here @youkaichao : https://github.com/vllm-project/vllm/issues/4632. The spec worker and non-spec workers share the same process.
> About the tree-attention/Medusa/Eagle, one of the core implementation will be tree attention mask in flash attention, which is currently not ready. I'd like to bring your attention to it...
@sighingnow this issue is for getting the 50% speedup. once the P0s are done we will get it with temperature 1.0.
> May I know more about the accept rate when we get the 50% speedup? Thanks! On llama2 7b / llama2 70b, the acceptance rate was like 80% (no fine...
I think this breaks master https://github.com/ray-project/ray/issues/32389