vllm
vllm copied to clipboard
[Bugfix] [Core] Fix zero temperature case (#5404 and part of #5898)
Fixes nondeterminism from the sampler that occurs even at zero temperature. The patch supports both zero and nonzero temperatures for different entries in the same batch, using the correct operation in each case. It does this by multiplying by an is-zero indicator for the zero temperature calculation and is-nonzero indicator for positive temperature case. These two are then summed. For any entry, only one term in the sum is nonzero as the two cases are mutually exclusive.
I created this patch and discussed it in https://github.com/vllm-project/vllm/issues/5404#issuecomment-2628860618. I used this patch internally and it greatly reduced variation in answers at zero temperature (remaining nondeterminism I think is due to CUDA nondeterminacy in the steps leading up to calculation of the logits).
Please feel free to make any improvements to this patch you can think of.
FIX #5404 (patch discussed there).
FIX #5898 (part from sampler, nondeterminism from the computation of the logits themselves not covered here).
👋 Hi! Thank you for contributing to the vLLM project.
💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.
Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.
To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.
🚀
For the pre-commit hook error here, it shows missing SPDX headers in the file tests/lora/test_ultravox.py which I didn't modify at all in my commits. The pre-commit hooks passed on my computer, so I think that was unrelated commits by others to vLLM since.
For the test with increased GPU VRAM usage, some increase makes sense due to the creation of additional tensors. The previous code just did an in-place divide after the conversion to float32, while the current code needs to compute tensors for both cases and sum them. I think correctness of outputs at zero temperature (a very common use case) is more important though than saving GPU VRAM at the sampling step. Also it may be possible to optimize the code to do more stuff in-place, but some new tensors are inevitable I think with doing it this way.
However, it's possible to avoid the increased GPU RAM usage with a custom CUDA kernel. Basically re-write this as a CUDA kernel and call that. No new tensors will be needed as the computations will be done in temporary variables within a loop, with the final output values assigned directly to the output tensor. I think we should first merge this though and then implement the CUDA kernel as an optimization. I have no experience writing CUDA kernels (though I did write and optimize OpenCL code many years ago).
Wow this is amazing! Thanks for the thorough investigation!
Sorry for the delay in review. LGTM overall, but I'd like to understand the problem in more detail. Let me get back to this within 1~2 days.
@StanHatko thank you for the investigation.
To me this looks like a workaround to an underlying problem of doing argmax on the logprobs for greedy.
Could you try out this PR which fixes that? https://github.com/vllm-project/vllm/pull/13312
At a minimum I think this should avoid the need for the * 1e9 * (logits == logits.max(dim=1, keepdim=True)[0]) part, i.e. instead just keep
logits_z = is_zero * logits
Man I forgot how convoluted our v0 sampler code was :sweat_smile:
Yes fixing the sampler to be deterministic in the temperature = 0 case seems like a better solution, ideally by just taking the largest logit in that case everything here can be avoided. I'll check if that pull request fixes the issue.
Unfortunately I'm having trouble reproducing some of the old non-deterministic runs from before. There were a bunch of changes since and the prompts here were long, complex, and had data that can't be released publicly. With simple prompts I didn't see the non-determinism in some basic tests. If anyone has a minimum working example please post it.
Some performance optimizations for throughput I enabled (in particular enable prefix caching, sorting before feeding in to take advantage of prefix caching, enable prefill chunking, and CUDA graph optimization) together seemed to reduce the amount of non-determinism (went down from 3, 4, 1, 2, 3 unique responses for 5 different queries each repeated 100 times down to 1, 2, 1, 1, 2).
The new PR looks good. I agree the sampler itself is the better place to make the changes, the reason I didn't do that myself is that I wasn't familiar with the sampler code but understood the logits and logprobs code. For the greedy sampling, is it guaranteed the flag will be set for zero-temperature entries? If there's a batch with a mixture of zero and nonzero temperatures, will it use greedy sampling for the zero-temperature entries and regular sampling for the rest?
Thanks @StanHatko
If there's a batch with a mixture of zero and nonzero temperatures, will it use greedy sampling for the zero-temperature entries and regular sampling for the rest?
@StanHatko yes that's correct, and greedy requests are determined by temperature < _SAMPLING_EPS.
This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @StanHatko.
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork
This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!
This pull request has been automatically closed due to inactivity. Please feel free to reopen if you intend to continue working on it. Thank you!