TensorRT-LLM how to set do

I tested the batch inference results of the llava and llava-next-video models using tensorrt-llm based on the examples/multimodal/run.py file. The parameters for their generate method are the same, as follows. Specifically, I used the default parameters in the generate method without any modifications. I set the batch size to 8. My question is: the batch inference results of the llava model are exactly the same for all 8 outputs. For the llava-next-video model, the 8 results are different. I want the batch inference results of the llava-next-video model to be exactly the same. How should I set the parameters to achieve the effect of the do_sample=False setting in the Hugging Face transformers model.generate method?


            output_ids = self.model.generate(
                input_ids,
                sampling_config=None,
                prompt_table=ptuning_args[0],
                max_new_tokens=max_new_tokens,
                end_id=end_id,
                pad_id=self.tokenizer.pad_token_id
                if self.tokenizer.pad_token_id is not None else
                self.tokenizer.all_special_ids[0],
                top_k=self.args.top_k,
                top_p=self.args.top_p,
                temperature=self.args.temperature,
                repetition_penalty=self.args.repetition_penalty,
                num_beams=self.args.num_beams,
                output_sequence_lengths=False,
                return_dict=False)

Jul 05 '24 08:07 AmazDeng

Additionally, I also tried passing the do_sample=False parameter; the inference didn't throw any errors, but it was ineffective, as the results within a batch were not exactly identical.

Jul 05 '24 08:07 AmazDeng

I ran the batch inference loop ten times. I found that although the results within each batch varied, the results between batches were exactly the same. So, where is the randomness in the tensorrt-llm generate method manifested? I now hope that the inference results within the same batch for the same prompt are completely identical.

Jul 05 '24 09:07 AmazDeng

In TensorRT-LLM, if you don't setup beam_width (default value is 1), then it uses sampling. Under sampling, you could use top_k, top_p to control the sampling. If you set top_k = 1, it will use greedy search.

If you set beam_width > 1, then TRT-LLM will use beam search and ignore the top_k and top_p values.

Jul 17 '24 08:07 byshiue

In TensorRT-LLM, if you don't setup beam_width (default value is 1), then it uses sampling. Under sampling, you could use top_k, top_p to control the sampling. If you set top_k = 1, it will use greedy search.

If you set beam_width > 1, then TRT-LLM will use beam search and ignore the top_k and top_p values.

So, there are two ways to set parameters:

top_k=1;
beam_width=1; Both methods can achieve the purpose of do_sample=False. Is this correct?

Jul 17 '24 13:07 AmazDeng

Not fully correct. To achieve do_sample=False, you should set top_k=1 and beam_width=1 at the same time.

Aug 05 '24 08:08 byshiue

how to set do_sample=False?