The model inference time is inconsistent

Open wsd12345 opened this issue 3 months ago • 1 comments

Description I use ensemble and python_backend. After 10 warmup with random data, The same data in the first inference time especially long, the second time to reach the expected time.

Triton Information What version of Triton are you using? 24.05-py3

Are you using the Triton container or did you build it yourself? Installed the necessary python libraries and then built.

To Reproduce Steps to reproduce the behavior.

start command： tritonserver --model-repository=/models --cuda-memory-pool-byte-size=0:1024000000
warmup： inference( np.random.uniform(size=size) )
run：I use 10 pieces of data,and use them randomly. for da in np.random.choice(data, len(data),replace=False):inference(da)

Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well).

Expected behavior A clear and concise description of what you expected to happen. I found that when the triton service was newly started, it took a long time to inference about each piece of data for the first time. But after the first inference, do the same 10 data inferences and each one is fast. For example, when I first inference one new data, it takes 20 seconds. But the second time only takes 0.1s. I use V100.

Sep 09 '25 09:09 wsd12345

Model and test code The above problem uses test/1.py. The deployment server is in the cloud, Tesla V100-SXM2-16GB. And, I found the following problems: I run test/2.py, that turned out to be the fastest first time.

In ensemble model, use http? And did caching cause the first problem. How should I configure it to avoid the first problem.

Sep 10 '25 07:09 wsd12345