server icon indicating copy to clipboard operation
server copied to clipboard

Possible GPU memory leak in Triton. Not draining.

Open nrepesh opened this issue 2 years ago • 9 comments

Description We start the triton model server which loads in our models with warmups. It takes around 20958MiB / 81920MiB when the triton model server is stable, healthy and ready to take requests. As we keep the model server functional and it takes requests, it fills up the memory usage on the gpu after a day and it fills 80342MiB / 81920MiB of the gpu memory. We then start to get errors like:

/workspace/onnxruntime/onnxruntime/core/providers/cuda/cuda_call.cc:116 bool onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*) [with ERRTYPE = cudaError; bool THRW = true] CUDA failure 2: out of memory ; GPU=1 ; hostname=5a73aae3220e ; expr=cudaMalloc((void**)&p, size);

Triton Information We are using Triton version 2.20 + FIL backend. Link

Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well). We are using a slew of Tensorflow, xgboost and Onnx models with warmups and batching. Here is a sample config file we use

name: "our_model" platform: "onnxruntime_onnx" max_batch_size: 16 input [ { name: "input" data_type: TYPE_FP32 dims: [3, -1, -1 ] } ] output [ { name: "labels" data_type: TYPE_INT64 dims: [ -1 ] }, { name: "dets" data_type: TYPE_FP32 dims: [-1,-1] } ] instance_group [ { count: 1 kind: KIND_GPU gpus: [ 1 ] } ] dynamic_batching { max_queue_delay_microseconds: 1000 } model_warmup [ {
name: "warmup 1" batch_size: 8
inputs: {
key: "input"
value: {
data_type: TYPE_FP32
dims: [3,512,512]
random_data: true
}
}
}, { name: "warmup 2" batch_size: 8
inputs: {
key: "input"
value: {
data_type: TYPE_FP32
dims: [3,512,512]
zero_data: true
}
} }, { name: "warmup 3" batch_size: 8
inputs: {
key: "input"
value: {
data_type: TYPE_FP32
dims: [3,512,512]
random_data: true
}
} } ]

Expected behavior Since we are working with many different models we would like to serve, we were expecting Triton model server to remain and use the expected gpu memory and not increase the gpu memory usages as time went on.

nrepesh avatar May 12 '22 15:05 nrepesh

Hi @nrepesh, can you please provide the exact command line for docker run ... and tritonserver ....

nv-kmcgill53 avatar May 12 '22 17:05 nv-kmcgill53

Sure! We use docker-compose for our container orchestration (sudo docker-compose up):

docker-compose.yml

version: "3.9" services: model-server: image: $OUR_IMG_PATH deploy: resources: reservations: devices: - driver: nvidia capabilities: ['gpu'] device_ids: ['2','3'] restart: "always" logging: *default-logging

Dockerfile with triton command

We are using Triton version 2.20 + FIL backend.

FROM $OUR_PATH/triton-server-with-xgboost:2.20 COPY /models /models EXPOSE 8000 EXPOSE 8001 EXPOSE 8002 CMD tritonserver --model-repository=/models --strict-model-config=false --log-verbose=1

nrepesh avatar May 12 '22 19:05 nrepesh

We are using a slew of Tensorflow, xgboost and Onnx models with warmups and batching

Are you able to isolate the causing backed or does this only occur when they are all used together? I don't see anything immediately wrong with your configuration or runtime parameters.

nv-kmcgill53 avatar May 13 '22 00:05 nv-kmcgill53

This seems to occur when we use them all together. Would you suggest us to isolate the backends and possibly recreate the issue to identify if it's the isolated backends causing the problem?

nrepesh avatar May 13 '22 16:05 nrepesh

Would you suggest us to isolate the backends and possibly recreate the issue to identify if it's the isolated backends causing the problem?

If possible, this would help us out a lot to narrow down the scope of investigation.

nv-kmcgill53 avatar May 13 '22 16:05 nv-kmcgill53

Hi @nrepesh,

Re: Tensorflow backend, it is a known limitation that TensorFlow does not release any memory it allocates until the backend is completely unloaded. There is a FAQ on TF backend specifically here: https://github.com/triton-inference-server/tensorflow_backend#how-does-the-tensorflow-backend-manage-gpu-memory

And in general, the backend won't be unloaded until the triton process exits: https://github.com/triton-inference-server/backend/blob/9666d8b5793d69f3c70b274890d97ae1f95ab45a/README.md#backend-lifecycles

I believe some other backends such as ONNX Runtime may also have this issue where they don't expose APIs to release their GPU memory pools, but @GuanLuo to confirm. I'm not sure if we have a comprehensive list somewhere.

rmccorm4 avatar May 16 '22 19:05 rmccorm4

Thank you for your reply. We have two models with TF backends. We will convert them to onnx backends and do an isolation onnx test. Do you suppose that if we have all models in the onnx backend we might not have the memory leak issue?

nrepesh avatar May 18 '22 14:05 nrepesh

I encountered similar case. I have two onnx models.

image

err info is:

image

[StatusCode.INTERNAL] onnx runtime error 6: Non-zero status code returned while running FusedConv node. Name:'Conv_60_Relu_61' Status Message: /workspace/onnxruntime/onnxruntime/core/providers/cuda/cuda_call.cc:122 bool onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*) [with ERRTYPE = cudaError; bool THRW = true] /workspace/onnxruntime/onnxruntime/core/providers/cuda/cuda_call.cc:116 bool onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*) [with ERRTYPE = cudaError; bool THRW = true] CUDA failure 2: out of memory ; GPU=0 ; hostname=21789e0f2300 ; expr=cudaMalloc((void**)&p, size);

This makes it impossible for me to use triton server in the production environment.

Leelaobai avatar Sep 22 '22 08:09 Leelaobai

Hi @Leelaobai ,

Is this a memory leak over time as new requests come in? Or do you just not have enough GPU memory for having both models loaded and inferencing at the same time? 8GB is quite low for many models, especially for multiple models and model instances.

rmccorm4 avatar Sep 22 '22 18:09 rmccorm4

Hi @rmccorm4 , When I haven't requested for a day, the GPU memory usage of the triton server still hasn't decreased.(About 3G) Therefore, I believe that at least one stage of code has not released gpu memory.

Leelaobai avatar Sep 27 '22 03:09 Leelaobai

Closing issue due to lack of activity. Please re-open the issue if you would like to follow up with this issue

jbkyang-nvi avatar Nov 22 '22 03:11 jbkyang-nvi