TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

[Bug] Lookahead decoding is nondeterministic and wrong after the first call to runner.generate

Open tloen opened this issue 1 year ago • 1 comments

System Info

  • x86_64
  • 2TB RAM
  • 8xH100
  • TensorRT-LLM main @ 40274aac39f2542483906d92ec3b8014faf62912
  • Cuda 12.5

Who can help?

@kaiyux @byshiue

Information

  • [x] The official example scripts
  • [x] My own modified scripts

Tasks

  • [x] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [ ] My own task or dataset (give details below)

Reproduction

Inside examples/run.py, add a for loop to the generation.

for _ in range(3): # THIS IS THE ONLY CHANGE
        with torch.no_grad():
            outputs = runner.generate(
                batch_input_ids=decoder_input_ids
                if is_enc_dec else batch_input_ids,
                encoder_input_ids=encoder_input_ids if is_enc_dec else None,
                encoder_input_features=encoder_input_features
                if is_enc_dec else None,
                encoder_output_lengths=encoder_output_lengths
                if is_enc_dec else None,
                max_new_tokens=args.max_output_len,
                max_attention_window_size=args.max_attention_window_size,
                sink_token_length=args.sink_token_length,
                end_id=end_id,
                pad_id=pad_id,
                temperature=args.temperature,
                top_k=args.top_k,
                top_p=args.top_p,
                num_beams=args.num_beams,
                length_penalty=args.length_penalty,
                early_stopping=args.early_stopping,
                repetition_penalty=args.repetition_penalty,
                presence_penalty=args.presence_penalty,
                frequency_penalty=args.frequency_penalty,
                stop_words_list=stop_words_list,
                bad_words_list=bad_words_list,
                output_cum_log_probs=(args.output_cum_log_probs_npy != None),
                output_log_probs=(args.output_log_probs_npy != None),
                random_seed=args.random_seed,
                lora_uids=args.lora_task_uids,
                prompt_table=args.prompt_table_path,
                prompt_tasks=args.prompt_tasks,
                streaming=args.streaming,
                output_sequence_lengths=True,
                no_repeat_ngram_size=args.no_repeat_ngram_size,
                return_dict=True,
                medusa_choices=args.medusa_choices,
                return_all_generated_tokens=args.return_all_generated_tokens,
                input_token_extra_ids=input_token_extra_ids)
            torch.cuda.synchronize()

        if args.streaming:
            for curr_outputs in throttle_generator(outputs,
                                                   args.streaming_interval):
                if runtime_rank == 0:
                    output_ids = curr_outputs['output_ids']
                    sequence_lengths = curr_outputs['sequence_lengths']
                    cum_log_probs = None
                    log_probs = None
                    if args.output_cum_log_probs_npy != None:
                        cum_log_probs = outputs['cum_log_probs']
                    if args.output_log_probs_npy != None:
                        log_probs = outputs['log_probs']
                    print_output(
                        tokenizer,
                        output_ids,
                        input_lengths,
                        sequence_lengths,
                        output_csv=args.output_csv,
                        output_npy=args.output_npy,
                        cum_log_probs=cum_log_probs,
                        log_probs=log_probs,
                        output_cum_log_probs_npy=args.output_cum_log_probs_npy,
                        output_log_probs_npy=args.output_log_probs_npy)
        else:
            if runtime_rank == 0:
                output_ids = outputs['output_ids']
                sequence_lengths = outputs['sequence_lengths']
                context_logits = None
                generation_logits = None
                cum_log_probs = None
                log_probs = None
                if runner.gather_context_logits:
                    context_logits = outputs['context_logits']
                if runner.gather_generation_logits:
                    generation_logits = outputs['generation_logits']
                if args.output_cum_log_probs_npy != None:
                    cum_log_probs = outputs['cum_log_probs']
                if args.output_log_probs_npy != None:
                    log_probs = outputs['log_probs']
                print_output(tokenizer,
                             output_ids,
                             input_lengths,
                             sequence_lengths,
                             output_csv=args.output_csv,
                             output_npy=args.output_npy,
                             context_logits=context_logits,
                             generation_logits=generation_logits,
                             output_logits_npy=args.output_logits_npy,
                             cum_log_probs=cum_log_probs,
                             log_probs=log_probs,
                             output_cum_log_probs_npy=args.output_cum_log_probs_npy,
                             output_log_probs_npy=args.output_log_probs_npy)
python run.py \
    --max_output_len=50 \
    --lookahead_config='[2,2,1]' \
    --tokenizer_dir=[DIR] \
    --engine_dir=[DIR]

Expected behavior

Input [Text 0]: "<|begin▁of▁sentence|>You are a diligent AI assistant that follows commands exactly.
### Instruction:
Please say "1" a thousand times.
### Response:
1, 1, 1, 1, 1,"
Output [Text 0 Beam 0]: " 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1"
Input [Text 0]: "<|begin▁of▁sentence|>You are a diligent AI assistant that follows commands exactly.
### Instruction:
Please say "1" a thousand times.
### Response:
1, 1, 1, 1, 1,"
Output [Text 0 Beam 0]: " 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1"
Input [Text 0]: "<|begin▁of▁sentence|>You are a diligent AI assistant that follows commands exactly.
### Instruction:
Please say "1" a thousand times.
### Response:
1, 1, 1, 1, 1,"
Output [Text 0 Beam 0]: " 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1"

actual behavior

Nondeterminism and incorrect responses after first iteration.

Input [Text 0]: "<|begin▁of▁sentence|>You are a diligent AI assistant that follows commands exactly.
### Instruction:
Please say "1" a thousand times.
### Response:
1, 1, 1, 1, 1,"
Output [Text 0 Beam 0]: " 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1"
Input [Text 0]: "<|begin▁of▁sentence|>You are a diligent AI assistant that follows commands exactly.
### Instruction:
Please say "1" a thousand times.
### Response:
1, 1, 1, 1, 1,"
Output [Text 0 Beam 0]: " 1, 1, 1, 1, 11 1111111111111111111111111111111111"
Input [Text 0]: "<|begin▁of▁sentence|>You are a diligent AI assistant that follows commands exactly.
### Instruction:
Please say "1" a thousand times.
### Response:
1, 1, 1, 1, 1,"
Output [Text 0 Beam 0]: " 1, 1, 1, 1, 1111111111111111111111111111111111111"

additional notes

Model is Llama architecture. max_draft_len is 107. Error doesn't happen when number of verification branches is zero or window size is 1.

tloen avatar Sep 27 '24 22:09 tloen

Thank you very much! The bug has been fixed recently, and will be released soon

davidmlw avatar Oct 08 '24 11:10 davidmlw

Hi @tloen , the issue should be addressed after this PR, can you please try and see if that solves the problem? Feel free to let us know if there are any more questions, thanks!

kaiyux avatar Oct 15 '24 07:10 kaiyux

Closing this issue stale, hoping it has been resolved by the PR mentioned above. If the problem persists in the latest release, please feel free to open a new one. Thank you!

karljang avatar Nov 14 '25 18:11 karljang