TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

chore: better quantization calibration loop for modelopt

Open michaelfeil opened this issue 10 months ago • 1 comments

Contribution by Baseten.co - ( engine builder team: https://docs.baseten.co/performance/engine-builder-overview )

  • relevant issue: https://github.com/NVIDIA/TensorRT-Model-Optimizer/issues/133
  • relevant fix in ModelOPT: https://github.com/NVIDIA/TensorRT-Model-Optimizer/pull/137

Trt-LLM has too many parameters to set. batch_size for quantization is needn't to be one of them, and can be automatically infered with trial and error.

As follow-up, it would be nice to have these changes integrated in modelopt, and used in modelopt. Additionally, modelopt uses a get_max_batch_size(model, max_sample_length) to determine a suitable max batch size. None of the options will make this

We at Baseten want to expose e.g. batch_size=64 for quantization and run the same workflow for any model - lowering the batch size by factors of two as needed.

michaelfeil avatar Feb 20 '25 23:02 michaelfeil

@michaelfeil Thanks for submitting the MR. TRT-LLM has just become github firstly to make it easier for the community engagement. Can you help rebase your MR based on the latest main and after finish your local validation, pls update this MR?

@Tracin pls help review this MR.

Thanks June

juney-nvidia avatar Mar 24 '25 05:03 juney-nvidia

Closing since no response after https://github.com/NVIDIA/TensorRT-LLM/pull/2806#issuecomment-2746960894. Feel free to rebase and reopen if the PR is still relevant!

poweiw avatar May 28 '25 05:05 poweiw