vllm [Bugfix] Handle `best_of>1` case by disabling speculation.

[Bugfix] Handle `best_of>1` case by disabling speculation.

Open tdoublep opened this issue 1 year ago • 1 comments

This PR solves #6137 by disabling speculation for batches that contain any request with best_of>1.

This approach ensures that we can handle requests withbest_of>1 without failure, but may have a downside that a single user sending requests with best_of>1 can potentially ruin performance for other users with best_of=1.

An alternative solution could just be to raise on those individual requests and give a message to the user like best_of > 1 is not supported when speculative decoding is enabled.

I can also implement that if preferred. What do you think @cadedaniel @njhill ?

Jul 04 '24 10:07 tdoublep

Thanks for the fix -- approach looks good to me.

On whether or not we should support this -- for performance we would want to disable this feature or support it natively in spec decode. I am fine having this in, can we log once if this happens so there's a hint of the performance degredation to users?

Jul 09 '24 06:07 cadedaniel

I am fine having this in, can we log once if this happens so there's a hint of the performance degredation to users?

I added a warning when we disable speculation due to n>1 or best_of>1.

Jul 15 '24 11:07 tdoublep

This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!

Oct 25 '24 02:10 github-actions[bot]

This pull request has been automatically closed due to inactivity. Please feel free to reopen if you intend to continue working on it. Thank you!

Nov 24 '24 02:11 github-actions[bot]

vllm vllm copied to clipboard

[Bugfix] Handle `best_of>1` case by disabling speculation.

vllm
vllm copied to clipboard