vllm [Bugfix] [Core] Fix zero temperature case (#5404 and part of #5898)

Fixes nondeterminism from the sampler that occurs even at zero temperature. The patch supports both zero and nonzero temperatures for different entries in the same batch, using the correct operation in each case. It does this by multiplying by an is-zero indicator for the zero temperature calculation and is-nonzero indicator for positive temperature case. These two are then summed. For any entry, only one term in the sum is nonzero as the two cases are mutually exclusive.

I created this patch and discussed it in https://github.com/vllm-project/vllm/issues/5404#issuecomment-2628860618. I used this patch internally and it greatly reduced variation in answers at zero temperature (remaining nondeterminism I think is due to CUDA nondeterminacy in the steps leading up to calculation of the logits).

Please feel free to make any improvements to this patch you can think of.

FIX #5404 (patch discussed there).

FIX #5898 (part from sampler, nondeterminism from the computation of the logits themselves not covered here).

Feb 06 '25 02:02 StanHatko

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Feb 06 '25 02:02 github-actions[bot]

For the pre-commit hook error here, it shows missing SPDX headers in the file tests/lora/test_ultravox.py which I didn't modify at all in my commits. The pre-commit hooks passed on my computer, so I think that was unrelated commits by others to vLLM since.

For the test with increased GPU VRAM usage, some increase makes sense due to the creation of additional tensors. The previous code just did an in-place divide after the conversion to float32, while the current code needs to compute tensors for both cases and sum them. I think correctness of outputs at zero temperature (a very common use case) is more important though than saving GPU VRAM at the sampling step. Also it may be possible to optimize the code to do more stuff in-place, but some new tensors are inevitable I think with doing it this way.

However, it's possible to avoid the increased GPU RAM usage with a custom CUDA kernel. Basically re-write this as a CUDA kernel and call that. No new tensors will be needed as the computations will be done in temporary variables within a loop, with the final output values assigned directly to the output tensor. I think we should first merge this though and then implement the CUDA kernel as an optimization. I have no experience writing CUDA kernels (though I did write and optimize OpenCL code many years ago).

Feb 09 '25 14:02 StanHatko

Wow this is amazing! Thanks for the thorough investigation!

Feb 11 '25 18:02 WoosukKwon

Sorry for the delay in review. LGTM overall, but I'd like to understand the problem in more detail. Let me get back to this within 1~2 days.

Feb 14 '25 07:02 WoosukKwon

@StanHatko thank you for the investigation.

To me this looks like a workaround to an underlying problem of doing argmax on the logprobs for greedy.

Could you try out this PR which fixes that? https://github.com/vllm-project/vllm/pull/13312

At a minimum I think this should avoid the need for the * 1e9 * (logits == logits.max(dim=1, keepdim=True)[0]) part, i.e. instead just keep

   logits_z = is_zero * logits

Man I forgot how convoluted our v0 sampler code was :sweat_smile:

Feb 14 '25 23:02 njhill

Yes fixing the sampler to be deterministic in the temperature = 0 case seems like a better solution, ideally by just taking the largest logit in that case everything here can be avoided. I'll check if that pull request fixes the issue.

Feb 15 '25 01:02 StanHatko

Unfortunately I'm having trouble reproducing some of the old non-deterministic runs from before. There were a bunch of changes since and the prompts here were long, complex, and had data that can't be released publicly. With simple prompts I didn't see the non-determinism in some basic tests. If anyone has a minimum working example please post it.

Some performance optimizations for throughput I enabled (in particular enable prefix caching, sorting before feeding in to take advantage of prefix caching, enable prefill chunking, and CUDA graph optimization) together seemed to reduce the amount of non-determinism (went down from 3, 4, 1, 2, 3 unique responses for 5 different queries each repeated 100 times down to 1, 2, 1, 1, 2).

The new PR looks good. I agree the sampler itself is the better place to make the changes, the reason I didn't do that myself is that I wasn't familiar with the sampler code but understood the logits and logprobs code. For the greedy sampling, is it guaranteed the flag will be set for zero-temperature entries? If there's a batch with a mixture of zero and nonzero temperatures, will it use greedy sampling for the zero-temperature entries and regular sampling for the rest?

Feb 17 '25 18:02 StanHatko

Thanks @StanHatko

If there's a batch with a mixture of zero and nonzero temperatures, will it use greedy sampling for the zero-temperature entries and regular sampling for the rest?

@StanHatko yes that's correct, and greedy requests are determined by temperature < _SAMPLING_EPS.

Feb 17 '25 19:02 njhill

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @StanHatko.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

May 11 '25 08:05 mergify[bot]

This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!

Aug 10 '25 02:08 github-actions[bot]

This pull request has been automatically closed due to inactivity. Please feel free to reopen if you intend to continue working on it. Thank you!

Sep 10 '25 02:09 github-actions[bot]

vllm vllm copied to clipboard

[Bugfix] [Core] Fix zero temperature case (#5404 and part of #5898)

vllm
vllm copied to clipboard