vllm
vllm copied to clipboard
[Bugfix] Handle `best_of>1` case by disabling speculation.
This PR solves #6137 by disabling speculation for batches that contain any request with best_of>1.
This approach ensures that we can handle requests withbest_of>1 without failure, but may have a downside that a single user sending requests with best_of>1 can potentially ruin performance for other users with best_of=1.
An alternative solution could just be to raise on those individual requests and give a message to the user like best_of > 1 is not supported when speculative decoding is enabled.
I can also implement that if preferred. What do you think @cadedaniel @njhill ?
Thanks for the fix -- approach looks good to me.
On whether or not we should support this -- for performance we would want to disable this feature or support it natively in spec decode. I am fine having this in, can we log once if this happens so there's a hint of the performance degredation to users?
I am fine having this in, can we log once if this happens so there's a hint of the performance degredation to users?
I added a warning when we disable speculation due to n>1 or best_of>1.
This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!
This pull request has been automatically closed due to inactivity. Please feel free to reopen if you intend to continue working on it. Thank you!