server icon indicating copy to clipboard operation
server copied to clipboard

RAM memory growth of triton server, until killed by OS

Open InfiniteLife opened this issue 11 months ago • 4 comments

Im using nvcr.io/nvidia/tritonserver:23.10-py3 container for my inferencing, using C++ GRPC API. There is several models in container, Yolov8-like architecture in Tensorrt plus a few Torchscript models. When inferencing I notice linear growth of RAM memory consumption of triton server, starting from 12-15 GB and linearly growing to up to 80 GB after 12 hours of constant inferencing, and growing further until, seems like, getting killed by OS OOM killer(Ubuntu 22.04).

Triton Server model load mode is default(not explicit), in inferencing API no shared memory is used, Sync and Async calls are made.

My question is what is good way to debug it?

InfiniteLife avatar Mar 26 '24 04:03 InfiniteLife

@InfiniteLife Is it possible to separately run TensorRT and PyTorch models. If the issue goes away for either one, that will help narrow down the issue to a single backend.

pvijayakrish avatar Mar 27 '24 00:03 pvijayakrish

My last experiment with config for all Torchscript models with parameter:

parameters: {
key: "ENABLE_CACHE_CLEANING"
    value: {
    string_value:"true"
    }
}

shown no memory growth. Seems like clearing cache helps. Pytorch is wierd.

InfiniteLife avatar Mar 27 '24 04:03 InfiniteLife

@InfiniteLife Could you please share the steps to reproduce the issue?

pvijayakrish avatar Mar 27 '24 20:03 pvijayakrish

I will need share models for that, but I cannot. But as I mentioned issue is gone if cache clearing is enabled for pytorch.

InfiniteLife avatar Apr 06 '24 09:04 InfiniteLife