[Bug] Lookahead decoding is nondeterministic and wrong after the first call to runner.generate
System Info
- x86_64
- 2TB RAM
- 8xH100
- TensorRT-LLM main @ 40274aac39f2542483906d92ec3b8014faf62912
- Cuda 12.5
Who can help?
@kaiyux @byshiue
Information
- [x] The official example scripts
- [x] My own modified scripts
Tasks
- [x] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
Inside examples/run.py, add a for loop to the generation.
for _ in range(3): # THIS IS THE ONLY CHANGE
with torch.no_grad():
outputs = runner.generate(
batch_input_ids=decoder_input_ids
if is_enc_dec else batch_input_ids,
encoder_input_ids=encoder_input_ids if is_enc_dec else None,
encoder_input_features=encoder_input_features
if is_enc_dec else None,
encoder_output_lengths=encoder_output_lengths
if is_enc_dec else None,
max_new_tokens=args.max_output_len,
max_attention_window_size=args.max_attention_window_size,
sink_token_length=args.sink_token_length,
end_id=end_id,
pad_id=pad_id,
temperature=args.temperature,
top_k=args.top_k,
top_p=args.top_p,
num_beams=args.num_beams,
length_penalty=args.length_penalty,
early_stopping=args.early_stopping,
repetition_penalty=args.repetition_penalty,
presence_penalty=args.presence_penalty,
frequency_penalty=args.frequency_penalty,
stop_words_list=stop_words_list,
bad_words_list=bad_words_list,
output_cum_log_probs=(args.output_cum_log_probs_npy != None),
output_log_probs=(args.output_log_probs_npy != None),
random_seed=args.random_seed,
lora_uids=args.lora_task_uids,
prompt_table=args.prompt_table_path,
prompt_tasks=args.prompt_tasks,
streaming=args.streaming,
output_sequence_lengths=True,
no_repeat_ngram_size=args.no_repeat_ngram_size,
return_dict=True,
medusa_choices=args.medusa_choices,
return_all_generated_tokens=args.return_all_generated_tokens,
input_token_extra_ids=input_token_extra_ids)
torch.cuda.synchronize()
if args.streaming:
for curr_outputs in throttle_generator(outputs,
args.streaming_interval):
if runtime_rank == 0:
output_ids = curr_outputs['output_ids']
sequence_lengths = curr_outputs['sequence_lengths']
cum_log_probs = None
log_probs = None
if args.output_cum_log_probs_npy != None:
cum_log_probs = outputs['cum_log_probs']
if args.output_log_probs_npy != None:
log_probs = outputs['log_probs']
print_output(
tokenizer,
output_ids,
input_lengths,
sequence_lengths,
output_csv=args.output_csv,
output_npy=args.output_npy,
cum_log_probs=cum_log_probs,
log_probs=log_probs,
output_cum_log_probs_npy=args.output_cum_log_probs_npy,
output_log_probs_npy=args.output_log_probs_npy)
else:
if runtime_rank == 0:
output_ids = outputs['output_ids']
sequence_lengths = outputs['sequence_lengths']
context_logits = None
generation_logits = None
cum_log_probs = None
log_probs = None
if runner.gather_context_logits:
context_logits = outputs['context_logits']
if runner.gather_generation_logits:
generation_logits = outputs['generation_logits']
if args.output_cum_log_probs_npy != None:
cum_log_probs = outputs['cum_log_probs']
if args.output_log_probs_npy != None:
log_probs = outputs['log_probs']
print_output(tokenizer,
output_ids,
input_lengths,
sequence_lengths,
output_csv=args.output_csv,
output_npy=args.output_npy,
context_logits=context_logits,
generation_logits=generation_logits,
output_logits_npy=args.output_logits_npy,
cum_log_probs=cum_log_probs,
log_probs=log_probs,
output_cum_log_probs_npy=args.output_cum_log_probs_npy,
output_log_probs_npy=args.output_log_probs_npy)
python run.py \
--max_output_len=50 \
--lookahead_config='[2,2,1]' \
--tokenizer_dir=[DIR] \
--engine_dir=[DIR]
Expected behavior
Input [Text 0]: "<|begin▁of▁sentence|>You are a diligent AI assistant that follows commands exactly.
### Instruction:
Please say "1" a thousand times.
### Response:
1, 1, 1, 1, 1,"
Output [Text 0 Beam 0]: " 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1"
Input [Text 0]: "<|begin▁of▁sentence|>You are a diligent AI assistant that follows commands exactly.
### Instruction:
Please say "1" a thousand times.
### Response:
1, 1, 1, 1, 1,"
Output [Text 0 Beam 0]: " 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1"
Input [Text 0]: "<|begin▁of▁sentence|>You are a diligent AI assistant that follows commands exactly.
### Instruction:
Please say "1" a thousand times.
### Response:
1, 1, 1, 1, 1,"
Output [Text 0 Beam 0]: " 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1"
actual behavior
Nondeterminism and incorrect responses after first iteration.
Input [Text 0]: "<|begin▁of▁sentence|>You are a diligent AI assistant that follows commands exactly.
### Instruction:
Please say "1" a thousand times.
### Response:
1, 1, 1, 1, 1,"
Output [Text 0 Beam 0]: " 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1"
Input [Text 0]: "<|begin▁of▁sentence|>You are a diligent AI assistant that follows commands exactly.
### Instruction:
Please say "1" a thousand times.
### Response:
1, 1, 1, 1, 1,"
Output [Text 0 Beam 0]: " 1, 1, 1, 1, 11 1111111111111111111111111111111111"
Input [Text 0]: "<|begin▁of▁sentence|>You are a diligent AI assistant that follows commands exactly.
### Instruction:
Please say "1" a thousand times.
### Response:
1, 1, 1, 1, 1,"
Output [Text 0 Beam 0]: " 1, 1, 1, 1, 1111111111111111111111111111111111111"
additional notes
Model is Llama architecture. max_draft_len is 107. Error doesn't happen when number of verification branches is zero or window size is 1.
Thank you very much! The bug has been fixed recently, and will be released soon
Hi @tloen , the issue should be addressed after this PR, can you please try and see if that solves the problem? Feel free to let us know if there are any more questions, thanks!
Closing this issue stale, hoping it has been resolved by the PR mentioned above. If the problem persists in the latest release, please feel free to open a new one. Thank you!