vllm icon indicating copy to clipboard operation
vllm copied to clipboard

[Bug]: topk=1 and temperature=0 cause different output in vllm

Open rangehow opened this issue 1 year ago • 20 comments

🐛 Describe the bug

When using different generation configurations, such as top_k=1 or temperature=0 (while keeping other settings unchanged), why do the generated results change? They should both correspond to a deterministic greedy decoding. vllm 0.4.3


Supplement:

The main issue encountered here is that the results generated by setting the temperature coefficient to 0 or topk to 1 are different. I understand that due to operator optimization and the lack of conventional arithmetic properties in floating-point numbers, matrix operations have a certain randomness. However, the sampling process occurs after the hidden_state is generated, at which point no calculations are involved. Therefore, the sampling results of the two sampling parameters should be the same.

rangehow avatar Jun 11 '24 03:06 rangehow

Hi,

I am also seeing different results for the same prompt even though temperature is set to 0. Complete sampling parameter is:

SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=0, top_
k=1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=128, min_tokens=0, logprobs=1, prompt_logp
robs=1, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None)

Version was just updated to v0.4.3.

mreso avatar Jun 11 '24 21:06 mreso

I'm investigating the issue. Verified bug by running examples/offline_inference.py with:

sampling_params = SamplingParams(temperature=0.0, max_tokens=10) llm = LLM(model="meta-llama/Meta-Llama-3-8B")

However, bug is present only when adding/removing prompts to/from the input batch. Same behavior is seen across older versions (v0.3.3, v0.4.2, v0.4.3).

Selected output for reference:

Prompt from first batch: 'Hello, my name is', Generated text: " and I'm writing you today to learn more about"

Prompt from first batch: 'The capital of France is', Generated text: ' Paris, which is located in the north of the'

VS

Prompt from second batch: 'Hello, my name is', Generated text: " and I'm writing you today to learn more about"

Prompt from second batch: 'The capital of France is', Generated text: ' Paris. It is located in the north of the'

Prompt from second batch: 'The future of AI is', Generated text: ' here, and it’s already changing the way we'

first_sampling_result = [
   [([323], [0]), ([12366], [0])],
   [([358], [0]), ([11], [0])],
   [([2846], [0]), ([902], [0])],
   [([4477], [0]), ([374], [0])],
   [([499], [0]), ([7559], [0])],
   [([3432], [0]), ([304], [0])],
   [([311], [0]), ([279], [0])],
   [([4048], [0]), ([10411], [0])],
   [([810], [0]), ([315], [0])], 
   [([922], [0]), ([279], [0])]
]

second_sampling_resuilt = [
   [([323], [0]), ([12366], [0]), ([1618], [0])],
   [([358], [0]), ([13], [0]), ([11], [0])],
   [([2846], [0]), ([1102], [0]), ([323], [0])],
   [([4477], [0]), ([374], [0]), ([433], [0])],
   [([499], [0]), ([7559], [0]), ([753], [0])],
   [([3432], [0]), ([304], [0]), ([2736], [0])],
   [([311], [0]), ([279], [0]), ([10223], [0])],
   [([4048], [0]), ([10411], [0]), ([279], [0])],
   [([810], [0]), ([315], [0]), ([1648], [0])],
   [([922], [0]), ([279], [0]), ([584], [0])]
]

EthanqX avatar Jun 12 '24 21:06 EthanqX

All fields looked as expected when I stepped into the Sampler code and examined sampling_metadata. Will further investigate model output before the sampling stage.

# === Sampling Metadata when generating second output token in previous example ===

SamplingMetadata(seq_groups=[
    SequenceGroupToSample(seq_ids=[0], sampling_params=SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[128001], include_stop_str_in_output=False, ignore_eos=False, max_tokens=10, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), seq_data={0: SequenceData(prompt_token_ids=[128000, 9906, 11, 856, 836, 374], output_token_ids=[323], cumulative_logprob=-4.27239990234375)}, seq_len=None, query_len=None, generator=None, is_prompt=False, prompt_logprob_indices=[], sample_indices=[0]), 
    SequenceGroupToSample(seq_ids=[1], sampling_params=SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[128001], include_stop_str_in_output=False, ignore_eos=False, max_tokens=10, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), seq_data={1: SequenceData(prompt_token_ids=[128000, 791, 6864, 315, 9822, 374], output_token_ids=[12366], cumulative_logprob=-1.4869756698608398)}, seq_len=None, query_len=None, generator=None, is_prompt=False, prompt_logprob_indices=[], sample_indices=[1])], selected_token_indices=tensor([0, 1], device='cuda:0'), categorized_sample_indices={<SamplingType.GREEDY: 0>: tensor([[0, 0],
    [1, 1]], device='cuda:0', dtype=torch.int32), <SamplingType.RANDOM: 1>: tensor([], device='cuda:0', size=(0, 2), dtype=torch.int32), <SamplingType.RANDOM_SEED: 2>: tensor([], device='cuda:0', size=(0, 2), dtype=torch.int32), <SamplingType.BEAM: 3>: tensor([], device='cuda:0', size=(0, 2), dtype=torch.int32)}),  
Sampling results: 
 [([358], [0]), ([11], [0])] 
 
 
SamplingMetadata(seq_groups=[
    SequenceGroupToSample(seq_ids=[2], sampling_params=SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[128001], include_stop_str_in_output=False, ignore_eos=False, max_tokens=10, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), seq_data={2: SequenceData(prompt_token_ids=[128000, 9906, 11, 856, 836, 374], output_token_ids=[323], cumulative_logprob=-4.276355743408203)}, seq_len=None, query_len=None, generator=None, is_prompt=False, prompt_logprob_indices=[], sample_indices=[0]), 
    SequenceGroupToSample(seq_ids=[3], sampling_params=SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[128001], include_stop_str_in_output=False, ignore_eos=False, max_tokens=10, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), seq_data={3: SequenceData(prompt_token_ids=[128000, 791, 6864, 315, 9822, 374], output_token_ids=[12366], cumulative_logprob=-1.4816458225250244)}, seq_len=None, query_len=None, generator=None, is_prompt=False, prompt_logprob_indices=[], sample_indices=[1]), 
    SequenceGroupToSample(seq_ids=[4], sampling_params=SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[128001], include_stop_str_in_output=False, ignore_eos=False, max_tokens=10, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), seq_data={4: SequenceData(prompt_token_ids=[128000, 791, 3938, 315, 15592, 374], output_token_ids=[1618], cumulative_logprob=-2.299220085144043)}, seq_len=None, query_len=None, generator=None, is_prompt=False, prompt_logprob_indices=[], sample_indices=[2])], selected_token_indices=tensor([0, 1, 2], device='cuda:0'), categorized_sample_indices={<SamplingType.GREEDY: 0>: tensor([[0, 0],
    [1, 1],
    [2, 2]], device='cuda:0', dtype=torch.int32), <SamplingType.RANDOM: 1>: tensor([], device='cuda:0', size=(0, 2), dtype=torch.int32), <SamplingType.RANDOM_SEED: 2>: tensor([], device='cuda:0', size=(0, 2), dtype=torch.int32), <SamplingType.BEAM: 3>: tensor([], device='cuda:0', size=(0, 2), dtype=torch.int32)}),  

Sampling results: 
 [([358], [0]), ([13], [0]), ([11], [0])]

EthanqX avatar Jun 15 '24 03:06 EthanqX

Is there any script that I can use to reproduce this issue?

I've been looking into #5607 which appears related, but after some digging it, that bug seems to related to the presence of repetition_penalty on some requests but not others. That doesn't seem to be the case here.

tdoublep avatar Jun 18 '24 12:06 tdoublep

I think #5607 fixed a different issue. After comparing logits before and after temperature scaling, I realized the zero-temperature is erroneously reassigned to 1.0. It should be temperature = _SAMPLING_EPS instead.

https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/sampling_metadata.py#L359-L363

EthanqX avatar Jun 18 '24 22:06 EthanqX

I think #5607 fixed a different issue. After comparing logits before and after temperature scaling, I realized the zero-temperature is erroneously reassigned to 1.0. It should be temperature = _SAMPLING_EPS instead.

https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/sampling_metadata.py#L359-L363

I think these lines of code are likely to be related to the problem, but whether the temperature should be set to _SAMPLING_EPS remains to be sorted out. I quickly tested this modification and found that the decoding result turned into nonsense output, unfortunately.

ShangmingCai avatar Jun 19 '24 03:06 ShangmingCai

🐛 Describe the bug

When using different generation configurations, such as top_k=1 or temperature=0 (while keeping other settings unchanged), why do the generated results change? They should both correspond to a deterministic greedy decoding. vllm 0.4.3

Supplement:

The main issue encountered here is that the results generated by setting the temperature coefficient to 0 or topk to 1 are different. I understand that due to operator optimization and the lack of conventional arithmetic properties in floating-point numbers, matrix operations have a certain randomness. However, the sampling process occurs after the hidden_state is generated, at which point no calculations are involved. Therefore, the sampling results of the two sampling parameters should be the same.

Hello, @rangehow may I ask about which model you are using to produce this bug? Lately, I encountered the same inconsistent behavior when setting top_k=1 (or temperature=0) for a GPTQ quantized model. I dug into the intermediate outputs and found that there is nothing to do with the sampling_metadata, but the hidden_state. The hidden_state inputs for the logits_procesor have already been slightly different for identical prompts.

Yet I am not able to reproduce this bug when I am using a non-quantized fp16 (bf16) model.

ShangmingCai avatar Jul 02 '24 10:07 ShangmingCai

🐛 Describe the bug

When using different generation configurations, such as top_k=1 or temperature=0 (while keeping other settings unchanged), why do the generated results change? They should both correspond to a deterministic greedy decoding. vllm 0.4.3

Supplement:

The main issue encountered here is that the results generated by setting the temperature coefficient to 0 or topk to 1 are different. I understand that due to operator optimization and the lack of conventional arithmetic properties in floating-point numbers, matrix operations have a certain randomness. However, the sampling process occurs after the hidden_state is generated, at which point no calculations are involved. Therefore, the sampling results of the two sampling parameters should be the same.

Hello, @rangehow may I ask about which model you are using to produce this bug? Lately, I encountered the same inconsistent behavior when setting top_k=1 (or temperature=0) for a GPTQ quantized model. I dug into the intermediate outputs and found that there is nothing to do with the sampling_metadata, but the hidden_state. The hidden_state inputs for the logits_procesor have already been slightly different for identical prompts.

Yet I am not able to reproduce this bug when I am using a non-quantized fp16 (bf16) model.

gemma-2b 😃

rangehow avatar Jul 02 '24 10:07 rangehow

Good morning everybody. I have also spotted an issue with the temperature=0. The model looks very less confident with T=0 than T=0.05 . With confidence I mean 1/Perplexit. Here follows the formula I have used:

MEAN_LOGPROBS = SUM(logprobs)/n_tokens CONFIDENCE = exp(MEAN_LOGPROBS)

I have repeated the experiment for more than 1000 samples per temperature. I am using a Mistral v02 FP8 and the OpenAI API via vllm. Here you can find a plot:

image

As you can see the cinfidence decreases but at T=0 it is at the level of T=0.6 while the condifence at T=0 should be the highest.

FilippoBoni1921 avatar Jul 29 '24 09:07 FilippoBoni1921

I realized I sent the wrong plot with a wrong x axis ticks. Here there is the good one:

screen

FilippoBoni1921 avatar Jul 29 '24 10:07 FilippoBoni1921

Has this bug been fixed?

hulongan avatar Sep 23 '24 08:09 hulongan

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

github-actions[bot] avatar Dec 23 '24 02:12 github-actions[bot]

Has this bug been fixed? Any update?

haoruilee avatar Dec 30 '24 16:12 haoruilee

From https://github.com/vllm-project/vllm/issues/5898#issuecomment-2576082209 I made a patch for this issue, copied below. Note this patch only affects the sampling. If input logits differ due to non-deterministic CUDA operations, this patch won't solve that issue. See the #5898 issue.

I tested this patch on two different systems (vLLM on Linux VM and vLLM within WSL on Windows GPU computer) and it worked on both. The output at zero temperature is far more consistent than before (tested with various queries repeated many times), the remaining randomness I think is due to CUDA nondeterminism.

In vllm/model_executor/sampling_metadata.py delete the following (at line 413 on the GitHub version as of now):

if temperature < _SAMPLING_EPS:
    # NOTE: Zero temperature means deterministic sampling
    # (i.e., greedy sampling or beam search).
    # Set the temperature to 1 to avoid division by zero.
    temperature = 1.0

In vllm/model_executor/layer/sampler.py replace the following (at line 268 on the GitHub version as of now):

# Use float32 to apply temperature scaling.
# Use in-place division to avoid creating a new tensor.
logits = logits.to(torch.float)
logits.div_(sampling_tensors.temperatures.unsqueeze(dim=1))

With as follows:

# Apply temperature scaling, special handling for zero-temperature case.
# Use float32 to apply temperature scaling in all cases.
logits = logits.to(torch.float)
temperature = sampling_tensors.temperatures.unsqueeze(dim=1)
is_zero = (temperature == 0)

# Positive temperature path.
# Need to adjust denominator to avoid division by zero causing problems.
# Any zero temperature entries are multiplied by False (0).
# This effectively means denominator adjustment never messes with things.
logits_p = (~is_zero) * logits / (temperature + is_zero)

# Zero temperature path.
# Any positive temperature entries are multipled by False (0).
logits_z = is_zero * 1e9 * (logits == logits.max(dim=1, keepdim=True)[0])

# Final logits is sum of both cases.
# Always one of them is zero since mutually exclusive.
logits = logits_p + logits_z

StanHatko avatar Jan 09 '25 14:01 StanHatko

In vllm/model_executor/sampling_metadata.py delete the following (at line 413 on the GitHub version as of now):

if temperature < _SAMPLING_EPS:
    # NOTE: Zero temperature means deterministic sampling
    # (i.e., greedy sampling or beam search).
    # Set the temperature to 1 to avoid division by zero.
    temperature = 1.0

or maybe can we just add top_k = 1 in this if condition?

zhc7 avatar Jan 09 '25 15:01 zhc7

Any update?

xuhaolei avatar Feb 01 '25 08:02 xuhaolei

I'm not sure if adding a top_k = 1 check in the condition would solve it, as a temperature of 1 will still cause non-deterministic behavior, unless there is some other check elsewhere. My patch solved the issue in my testing (reduced randomness, though some was still left due to CUDA nondeterminism in the steps up to calculating the logits I think) and works with mixture of zero and nonzero temperatures in the same batch, it's not the most elegant solution but it works.

Should I make a pull request with my patch, or does someone have a better solution?

StanHatko avatar Feb 02 '25 15:02 StanHatko

I'm not sure if adding a top_k = 1 check in the condition would solve it, as a temperature of 1 will still cause non-deterministic behavior, unless there is some other check elsewhere. My patch solved the issue in my testing (reduced randomness, though some was still left due to CUDA nondeterminism in the steps up to calculating the logits I think) and works with mixture of zero and nonzero temperatures in the same batch, it's not the most elegant solution but it works.

Should I make a pull request with my patch, or does someone have a better solution?

I tried adding a top_k = 1 check in the condition, but it doesn't seem to solve the problem.

Did you use LoRA? For me, I found that the issue seems to come from LoRA. (#7977) My vLLM version is 0.6.3.post1. When I tested the GSM8K dataset with the LLaMA3-8B-Instruct model, I found that greedy search remained consistent, but once I added the LoRA module, the results became inconsistent.

A temporary solution I found is to merge the LoRA module into the original model, so I don’t have to use vLLM’s LoRA adapter. After doing this, the consistency was maintained.

xuhaolei avatar Feb 03 '25 07:02 xuhaolei

For me there was no LoRA, it was just with regular inference.

StanHatko avatar Feb 06 '25 02:02 StanHatko

Sorry for being late to this discussion. I think it's expected that top_k = 1 and temp = 0 could give different results and that in particular top_k = 1 with nonzero temp and no seed could be especially nondeterministic.

top_k = 1 actually means sample from all tokens with probabilities matching the highest token's probability, so if there's more than one token tied for the top place (which is not uncommon with fp16), one will be chosen randomly.

njhill avatar Feb 14 '25 23:02 njhill

However, it would produce more stable (actually the same) generations using hf transformers when setting temperature = 0

halfrot avatar Mar 16 '25 09:03 halfrot

Any updates now? I have same problem! Maybe you don't use model.eval() for dropout layer ?

potaninmt avatar Apr 05 '25 15:04 potaninmt

For me the determinism seems a lot better when using a new vLLM (specifically 0.8.2 is what I currently use) with the V1 engine, manually setting the temperature to zero. There is still a bit of nondeterminism left (due to I think nondeterministic CUDA operations, as discussed elsewhere) but it's much better than before.

Do others still have this issue when using the new V1 engine for vLLM? If not, maybe V1 fixed this issue.

StanHatko avatar Apr 07 '25 02:04 StanHatko

@StanHatko The thing is that when the temperature is 0, according to the probability formula for the first token should always be exactly 1. In any case, this is a bug.

potaninmt avatar Apr 07 '25 14:04 potaninmt

@potaninmt are you using v0 or v1? This PR should hopefully help with v0: https://github.com/vllm-project/vllm/pull/13312

njhill avatar Apr 07 '25 17:04 njhill

There are two distinct causes of nondeterminism at zero temperature:

  • In the sampler, which https://github.com/vllm-project/vllm/pull/13312 should fix for V0 sampler. I don't think the sampler bug exists in V1, but if it does please post it here since in that case we need to fix it for the V1 sampler as well.
  • In the computations leading up to the logits, which can involve nondeterministic CUDA operations for various steps (like cumulative sums). See https://github.com/vllm-project/vllm/issues/2910 and https://github.com/vllm-project/vllm/issues/5898, as well as PyTorch issue https://github.com/pytorch/pytorch/issues/75240. For these even if the sampler is fixed, since the inputs to the sampler are different there will still be different outputs at temperature 0.

StanHatko avatar Apr 09 '25 03:04 StanHatko

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

github-actions[bot] avatar Jul 09 '25 02:07 github-actions[bot]

Pinging this issue to keep it unstable because many of us would still like to see it resolved.

wgantt avatar Jul 25 '25 19:07 wgantt

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

github-actions[bot] avatar Oct 25 '25 02:10 github-actions[bot]

pinging again to keep unstale

wgantt avatar Oct 25 '25 12:10 wgantt