server
server copied to clipboard
ONNXRuntime TensorRT cache gets regenerated every time a model is uploaded even with correct settings
Description
Using onnxruntime backend with tensorrt and engine cache via the load_model makes the tensorrt cache to be regenerated every time.
Triton Information Triton container nvcr.io/nvidia/tritonserver:22.06-py3
To Reproduce
config file
{
"name": "model-name",
"platform": "onnxruntime_onnx",
"optimization": {
"execution_accelerators": {
"gpu_execution_accelerator": [
{
"name": "tensorrt",
"parameters": {
"trt_engine_cache_path": "/root/.cache/triton-tensorrt",
"trt_engine_cache_enable": "true",
"precision_mode": "FP16"
}
}
]
}
}
}
Any onnx file and call:
triton_client.load_model(model_name, config=model_config_json, files={"file:1/model.onnx": onnx_model_binary})
Expected behavior
It should generate the engine cache only once.
The problem comes from https://github.com/triton-inference-server/core/blob/bb9756f2012b3b15bf8d7a9e1e2afd62a7e603b5/src/model_repository_manager.cc#L108 where it creates a temporary folder with a random name and the trt engine cache uses the path as part of the cache
Hi @fran6co ,
Thanks for reporting the issue and doing some initial investigation.
@GuanLuo what do you think, related to your recent override changes?
This also happens when using models from a cloud service like s3
There are 3 solutions:
- change how the tensorrt cache path is generated (this needs a change in onnxruntime
- create temporary path with consistent names when dealing with cloud or overridden
- change triton onnxruntime backend to not use paths but binary, this produces consistent tensorrt caches https://github.com/triton-inference-server/onnxruntime_backend/pull/126
This would be very helpul to speed up development and reduce our system's start time.
Filed DLIS-3954 to look into this.
Any news on this topic? I still face the same issue.