TensorRT-LLM chore: better quantization calibration loop for modelopt

Contribution by Baseten.co - ( engine builder team: https://docs.baseten.co/performance/engine-builder-overview )

relevant issue: https://github.com/NVIDIA/TensorRT-Model-Optimizer/issues/133
relevant fix in ModelOPT: https://github.com/NVIDIA/TensorRT-Model-Optimizer/pull/137

Trt-LLM has too many parameters to set. batch_size for quantization is needn't to be one of them, and can be automatically infered with trial and error.

As follow-up, it would be nice to have these changes integrated in modelopt, and used in modelopt. Additionally, modelopt uses a get_max_batch_size(model, max_sample_length) to determine a suitable max batch size. None of the options will make this

We at Baseten want to expose e.g. batch_size=64 for quantization and run the same workflow for any model - lowering the batch size by factors of two as needed.

Feb 20 '25 23:02 michaelfeil

@michaelfeil Thanks for submitting the MR. TRT-LLM has just become github firstly to make it easier for the community engagement. Can you help rebase your MR based on the latest main and after finish your local validation, pls update this MR?

@Tracin pls help review this MR.

Thanks June

Mar 24 '25 05:03 juney-nvidia

Closing since no response after https://github.com/NVIDIA/TensorRT-LLM/pull/2806#issuecomment-2746960894. Feel free to rebase and reopen if the PR is still relevant!

May 28 '25 05:05 poweiw