server Triton server memory accumulation problem

Is your feature request related to a problem? Please describe. A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Describe the solution you'd like A clear and concise description of what you want to happen. I'm working on sending an inference request to triton via deepstream. Comparing the memory consumed by the GPU before sending an inference request to the triton server and after completing the request, There is an issue where the post continues to eat memory even though the request has completed.

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context Add any other context or screenshots about the feature request here.

I use triton23.10. If you run tritonserver and look at nvidia-smi, it says that 5520 is being used for GPU. In deepstream6.1, it consumed around 8176 memory because it sent permission requests to tritonserver. When the problem request is completed, I check nvidia-smi and the memory is 5680, so I repeat the request and the memory is 5680. Get nvidia-smi certified This causes memory waste to accumulate. I would appreciate it if you could tell me how to resolve this issue.

docker run --gpus device=3 -d --name st_model_convert_always --restart=always –net=host -v /home/users/asd/model:/models nvcr.io/nvidia/tritonserver:23.10-py3 tritonserver -- current=/model

The invitation was created in deepstream6.1.

Immediately after running triton ±------------------------------±---------------------±---------------------+ | 1 NVIDIA GeForce … Off | 00000000:04:00.0 Off | N/A | | 35% 30C P2 55W / 260W | 5520MiB / 11264MiB | 0% Default | | | | N/A | ±------------------------------±---------------------±---------------------+

The trtion server is receiving the request and processing it. ±------------------------------±---------------------±---------------------+ | 1 NVIDIA GeForce … Off | 00000000:04:00.0 Off | N/A | | 35% 31C P2 65W / 260W | 8176MiB / 11264MiB | 0% Default | | | | N/A | ±------------------------------±---------------------±---------------------+

After the triton request is completed ±------------------------------±---------------------±---------------------+ | 1 NVIDIA GeForce … Off | 00000000:04:00.0 Off | N/A | | 35% 31C P2 55W / 260W | 5680MiB / 11264MiB | 0% Default | | | | N/A | ±------------------------------±---------------------±---------------------+

Mar 08 '24 06:03 yoo-wonjun

Hi @yoo-wonjun. Which backend did you use? Can you also provide a complete nvidia-smi output and check how much memory Triton uses? If it was Triton, can you provide in detail on how to reproduce. Thanks.

Mar 09 '24 01:03 yinggeh

Hi @yinggeh I will answer as follows

I opened the server with triton server docker version 23.10.
triton used tensorrt_plan as the backend.
Send a request to the triton server with grpc using deepstream docker 6.1 version.
If a rejection occurs at once, the memory consumed initially continues to increase by about 100 MB, causing the memory to be turned off later.

Mar 12 '24 04:03 yoo-wonjun

@krishung5 Do you have any idea or insight on this?

Mar 12 '24 22:03 lkomali

Hi @yoo-wonjun, regarding

When the problem request is completed, I check nvidia-smi and the memory is 5680, so I repeat the request and the memory is 5680.

I was wondering if you mean that you are observing GPU memory consistently growing after each run? Or do you mean that after repeatedly sending requests, the GPU memory remains at 5680 MB and does not return to the initial 5520 MB? If it's the former, could you please share a minimal reproducer with us to demonstrate the memory growth? If it's the latter case, it might be due to some framework-specific memory allocation. Consistently growing memory would be concerning in this situation.

If a rejection occurs at once, the memory consumed initially continues to increase by about 100 MB, causing the memory to be turned off later.

Could you please clarify - do you mean that if a request is not successfully sent, the memory will increase by about 100MB each time, causing OOM eventually?

Mar 13 '24 15:03 krishung5

@krishung5 You are correct in saying that the GPU memory remains at 5680MB and does not return to the initial 5520MB even after sending the request repeatedly. To be more specific, when a request is made at 5520MB, it occupies about 7600MB of memory (the difference in memory is the client's memory). When the client's request is completed, it becomes 5680MB as mentioned, unless the triton server is lowered and loaded. It doesn't come back to 5520mb. And as the operation is repeated, the memory becomes more and more and eventually oom occurs. If it depends on the framework you mentioned, does this mean that it may be a problem that occurs when using tensorrt?

Mar 15 '24 02:03 yoo-wonjun

@yoo-wonjun Thanks for the explanation.

If it depends on the framework you mentioned, does this mean that it may be a problem that occurs when using tensorrt?

I mean that if the memory does not go back to the initial 5520 MB but remains stable, then it might be due to some framework-specific memory allocation. I believe it's not your case given that you observe that the memory is growing and eventually OOM occurs.

Can you try with newer version of Triton(the latest is 24.04) and see if the OOM still happens? We do have some fixes for memory leaks during the past few months. If the memory growth is observed on newer Triton, can you share a minimal reproducer with us to investigate?

May 08 '24 19:05 krishung5

Closing due to lack of activity. Please re-open the issue if you would like to follow up with this issue.

Jul 09 '24 23:07 krishung5

server server copied to clipboard

Triton server memory accumulation problem

server
server copied to clipboard