how to set do_sample=False?
I tested the batch inference results of the llava and llava-next-video models using tensorrt-llm based on the examples/multimodal/run.py file. The parameters for their generate method are the same, as follows. Specifically, I used the default parameters in the generate method without any modifications. I set the batch size to 8. My question is: the batch inference results of the llava model are exactly the same for all 8 outputs. For the llava-next-video model, the 8 results are different. I want the batch inference results of the llava-next-video model to be exactly the same. How should I set the parameters to achieve the effect of the do_sample=False setting in the Hugging Face transformers model.generate method?
output_ids = self.model.generate(
input_ids,
sampling_config=None,
prompt_table=ptuning_args[0],
max_new_tokens=max_new_tokens,
end_id=end_id,
pad_id=self.tokenizer.pad_token_id
if self.tokenizer.pad_token_id is not None else
self.tokenizer.all_special_ids[0],
top_k=self.args.top_k,
top_p=self.args.top_p,
temperature=self.args.temperature,
repetition_penalty=self.args.repetition_penalty,
num_beams=self.args.num_beams,
output_sequence_lengths=False,
return_dict=False)
Additionally, I also tried passing the do_sample=False parameter; the inference didn't throw any errors, but it was ineffective, as the results within a batch were not exactly identical.
I ran the batch inference loop ten times. I found that although the results within each batch varied, the results between batches were exactly the same. So, where is the randomness in the tensorrt-llm generate method manifested? I now hope that the inference results within the same batch for the same prompt are completely identical.
In TensorRT-LLM, if you don't setup beam_width (default value is 1), then it uses sampling. Under sampling, you could use top_k, top_p to control the sampling. If you set top_k = 1, it will use greedy search.
If you set beam_width > 1, then TRT-LLM will use beam search and ignore the top_k and top_p values.
In TensorRT-LLM, if you don't setup beam_width (default value is 1), then it uses sampling. Under sampling, you could use top_k, top_p to control the sampling. If you set top_k = 1, it will use greedy search.
If you set beam_width > 1, then TRT-LLM will use beam search and ignore the top_k and top_p values.
So, there are two ways to set parameters:
- top_k=1;
- beam_width=1; Both methods can achieve the purpose of do_sample=False. Is this correct?
Not fully correct. To achieve do_sample=False, you should set top_k=1 and beam_width=1 at the same time.