onnxruntime_backend Enable "trt_build_heuristics_enable" optimization for onnxruntime-TensorRT

Enable "trt_build_heuristics_enable" optimization for onnxruntime-TensorRT

Open tobaiMS opened this issue 1 year ago • 2 comments

trafficstars

OnnxRuntime have support for trt_build_heuristics_enable with TensorRT optimization We observed that some of the inference request take extremely long time, when the user traffic change, without using the TensorRT optimization, we set the default onnxruntime with { key: "cudnn_conv_algo_search" value: { string_value: "1" } } to enable heuristic search, however when move to use TensorRT, this setting will be ignored, ort provides an alternative setting "trt_build_heuristics_enable" here https://onnxruntime.ai/docs/execution-providers/TensorRT-ExecutionProvider.html#configurations for TRT that we would like to try with Triton. , which is not supported in the Triton model config.

Feb 23 '24 22:02 tobaiMS

I would recommend rather enabling the timing cache. That will accelerate engine builds drastically. An engine cache will further help with not rebuilding the engine each time the same model is requested.

when the user traffic change What exactly do you mean by that ? Dynamic shapes or different models ?

Feb 26 '24 17:02 gedoensmax

Hi, @gedoensmax thanks for the reply. currently I already enabled parameters { key: "trt_engine_cache_enable" value: "True" } parameters { key: "trt_engine_cache_path" value: "/tmp" } for the "trt_timing_cache_path" seems it's also not supported in the triton ORT TRT configuration, https://github.com/triton-inference-server/onnxruntime_backend?tab=readme-ov-file#onnx-runtime-with-tensorrt-optimization

Feb 26 '24 21:02 tobaiMS

onnxruntime_backend onnxruntime_backend copied to clipboard

Enable "trt_build_heuristics_enable" optimization for onnxruntime-TensorRT

onnxruntime_backend
onnxruntime_backend copied to clipboard