server
server copied to clipboard
Possible GPU memory leak in Triton. Not draining.
Description We start the triton model server which loads in our models with warmups. It takes around 20958MiB / 81920MiB when the triton model server is stable, healthy and ready to take requests. As we keep the model server functional and it takes requests, it fills up the memory usage on the gpu after a day and it fills 80342MiB / 81920MiB of the gpu memory. We then start to get errors like:
/workspace/onnxruntime/onnxruntime/core/providers/cuda/cuda_call.cc:116 bool onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*) [with ERRTYPE = cudaError; bool THRW = true] CUDA failure 2: out of memory ; GPU=1 ; hostname=5a73aae3220e ; expr=cudaMalloc((void**)&p, size);
Triton Information We are using Triton version 2.20 + FIL backend. Link
Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well). We are using a slew of Tensorflow, xgboost and Onnx models with warmups and batching. Here is a sample config file we use
name: "our_model" platform: "onnxruntime_onnx" max_batch_size: 16 input [ { name: "input" data_type: TYPE_FP32 dims: [3, -1, -1 ] } ] output [ { name: "labels" data_type: TYPE_INT64 dims: [ -1 ] }, { name: "dets" data_type: TYPE_FP32 dims: [-1,-1] } ] instance_group [ { count: 1 kind: KIND_GPU gpus: [ 1 ] } ] dynamic_batching { max_queue_delay_microseconds: 1000 } model_warmup [ {
name: "warmup 1" batch_size: 8
inputs: {
key: "input"
value: {
data_type: TYPE_FP32
dims: [3,512,512]
random_data: true
}
}
}, { name: "warmup 2" batch_size: 8
inputs: {
key: "input"
value: {
data_type: TYPE_FP32
dims: [3,512,512]
zero_data: true
}
} }, { name: "warmup 3" batch_size: 8
inputs: {
key: "input"
value: {
data_type: TYPE_FP32
dims: [3,512,512]
random_data: true
}
} } ]
Expected behavior Since we are working with many different models we would like to serve, we were expecting Triton model server to remain and use the expected gpu memory and not increase the gpu memory usages as time went on.
Hi @nrepesh, can you please provide the exact command line for docker run ...
and tritonserver ...
.
Sure! We use docker-compose for our container orchestration (sudo docker-compose up):
docker-compose.yml
version: "3.9" services: model-server: image: $OUR_IMG_PATH deploy: resources: reservations: devices: - driver: nvidia capabilities: ['gpu'] device_ids: ['2','3'] restart: "always" logging: *default-logging
Dockerfile with triton command
We are using Triton version 2.20 + FIL backend.
FROM $OUR_PATH/triton-server-with-xgboost:2.20 COPY /models /models EXPOSE 8000 EXPOSE 8001 EXPOSE 8002 CMD tritonserver --model-repository=/models --strict-model-config=false --log-verbose=1
We are using a slew of Tensorflow, xgboost and Onnx models with warmups and batching
Are you able to isolate the causing backed or does this only occur when they are all used together? I don't see anything immediately wrong with your configuration or runtime parameters.
This seems to occur when we use them all together. Would you suggest us to isolate the backends and possibly recreate the issue to identify if it's the isolated backends causing the problem?
Would you suggest us to isolate the backends and possibly recreate the issue to identify if it's the isolated backends causing the problem?
If possible, this would help us out a lot to narrow down the scope of investigation.
Hi @nrepesh,
Re: Tensorflow backend, it is a known limitation that TensorFlow does not release any memory it allocates until the backend is completely unloaded. There is a FAQ on TF backend specifically here: https://github.com/triton-inference-server/tensorflow_backend#how-does-the-tensorflow-backend-manage-gpu-memory
And in general, the backend won't be unloaded until the triton process exits: https://github.com/triton-inference-server/backend/blob/9666d8b5793d69f3c70b274890d97ae1f95ab45a/README.md#backend-lifecycles
I believe some other backends such as ONNX Runtime may also have this issue where they don't expose APIs to release their GPU memory pools, but @GuanLuo to confirm. I'm not sure if we have a comprehensive list somewhere.
Thank you for your reply. We have two models with TF backends. We will convert them to onnx backends and do an isolation onnx test. Do you suppose that if we have all models in the onnx backend we might not have the memory leak issue?
I encountered similar case. I have two onnx models.
err info is:
[StatusCode.INTERNAL] onnx runtime error 6: Non-zero status code returned while running FusedConv node. Name:'Conv_60_Relu_61' Status Message: /workspace/onnxruntime/onnxruntime/core/providers/cuda/cuda_call.cc:122 bool onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*) [with ERRTYPE = cudaError; bool THRW = true] /workspace/onnxruntime/onnxruntime/core/providers/cuda/cuda_call.cc:116 bool onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*) [with ERRTYPE = cudaError; bool THRW = true] CUDA failure 2: out of memory ; GPU=0 ; hostname=21789e0f2300 ; expr=cudaMalloc((void**)&p, size);
This makes it impossible for me to use triton server in the production environment.
Hi @Leelaobai ,
Is this a memory leak over time as new requests come in? Or do you just not have enough GPU memory for having both models loaded and inferencing at the same time? 8GB is quite low for many models, especially for multiple models and model instances.
Hi @rmccorm4 , When I haven't requested for a day, the GPU memory usage of the triton server still hasn't decreased.(About 3G) Therefore, I believe that at least one stage of code has not released gpu memory.
Closing issue due to lack of activity. Please re-open the issue if you would like to follow up with this issue