chore: better quantization calibration loop for modelopt
Contribution by Baseten.co - ( engine builder team: https://docs.baseten.co/performance/engine-builder-overview )
- relevant issue: https://github.com/NVIDIA/TensorRT-Model-Optimizer/issues/133
- relevant fix in ModelOPT: https://github.com/NVIDIA/TensorRT-Model-Optimizer/pull/137
Trt-LLM has too many parameters to set. batch_size for quantization is needn't to be one of them, and can be automatically infered with trial and error.
As follow-up, it would be nice to have these changes integrated in modelopt, and used in modelopt. Additionally, modelopt uses a get_max_batch_size(model, max_sample_length) to determine a suitable max batch size. None of the options will make this
We at Baseten want to expose e.g. batch_size=64 for quantization and run the same workflow for any model - lowering the batch size by factors of two as needed.
@michaelfeil Thanks for submitting the MR. TRT-LLM has just become github firstly to make it easier for the community engagement. Can you help rebase your MR based on the latest main and after finish your local validation, pls update this MR?
@Tracin pls help review this MR.
Thanks June
Closing since no response after https://github.com/NVIDIA/TensorRT-LLM/pull/2806#issuecomment-2746960894. Feel free to rebase and reopen if the PR is still relevant!