server icon indicating copy to clipboard operation
server copied to clipboard

Tritonserver Physical RAM Grow Overtime

Open apokerce opened this issue 1 year ago • 18 comments

Description Found out that the RAM memory is being increasing in the machine reaching to a point (or it becomes a bit high and temporary need of RAM) cause tritonserver to crash due to OOM(Out of Memory). Memory is increasing constantly at rate of 0.2578125 MB around 15 second period when concurrency is 20. Tried when –grpc-infer-allocation-pool-size set to 0 as well. Valgrind Result Running for 1 Hour definitely lost: 145,440 bytes in 3 blocks indirectly lost: 0 bytes in 0 blocks possibly lost: 37,700 bytes in 212 blocks still reachable: 70,749,123 bytes in 36,493 blocks

Triton Information What version of Triton are you using? "nvcr.io/nvidia/tensorrt:23.12-py3"

Are you using the Triton container or did you build it yourself? Usin Triton container

To Reproduce config.txt run.txt

Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well). Config file of the model and how triton configured is attached in the To Reproduce section.

Expected behavior RAM usage to be saturated at some point to not cause any OOM.

apokerce avatar Jan 10 '24 08:01 apokerce

Thanks @apokerce for filing this issue and creating detailed repro instructions. I'll file a ticket for the team to investigate.

Tabrizian avatar Jan 10 '24 16:01 Tabrizian

Hi @apokerce, thanks for the repro steps, we will be looking into the memory issue. Meanwhile, could you also provide the full output from Valgrind? In our CI testing we do run Valgrind with the TRT model, and there are some memory leaks marked as "definitely lost" that are actually from other bugs. For example, the dl-open ones are due to this bug. We have this white list for memory leaks: https://github.com/triton-inference-server/server/blob/main/qa/common/check_valgrind_log.py#L32-L48. Just wanted to check if there are some other leaks in the Valgrind output you got.

krishung5 avatar Jan 30 '24 01:01 krishung5

Hi @krishung5 , sorry for the delay, I could not find the shared valgrind output (full output). I am attaching below valgrind result we mainly interested in possibly lost part because it does not being released at all. Thank you.

valgrind-out-.txt

apokerce avatar Feb 06 '24 16:02 apokerce

@apokerce Thanks for providing the file. Can you also let me know what kind of hardware like GPU/device/framework that you are using? I wasn't able to repro the OOM issue using the config above, so just wanted to double check if there's anything that I missed.

krishung5 avatar Feb 08 '24 22:02 krishung5

@krishung5 We use A40 GPU, most of the models are tensorrt converted and some small PyTorch models. OOM issue will raise up around 6 days with around 60-80 consecutive requests that are coming from clients actually (Our ensemble has multiple models of course). Our problem is RAM usage of triton server not to come to a cutoff point and eventually when there is a need of RAM from another pod in cluster (like starting a new pod) it causes OOM. Note that using system shared memory is filling RAM memory slower than grpc requests. That 6 day I mentioned is combined version like most clients use shared memory some use grpc. However, when all on grpc memory grow is faster.

apokerce avatar Feb 09 '24 05:02 apokerce

Hi @apokerce, I ran the reproducer on A40 but still couldn't observe any memory growth. I used the following script to run perf_analyzer for several iterations and obtained memory usage from free -m:

for i in $(seq 1 1000); do
    echo 'count' $i
    # show memory
    free -m >> $LOG_FILE

    perf_analyzer -m $MODEL_NAME --concurrency-range=20 --input-data=random --shape=input__0:3,640,640 -b 1 -u localhost:8001 -i grpc --measurement-interval 1000
done

In my experiments, the memory usage remained flat after some point. Here are the results from three cases:

  1. Without using the --grpc-infer-allocation-pool-size flag (default size is 8).
  2. With the --grpc-infer-allocation-pool-size flag set to 1024.
  3. With the --grpc-infer-allocation-pool-size flag set to 0.

image

From the graph, there is no memory growth observed.

Could you please try the following options to help narrow down the investigation:

  1. Run with the HTTP endpoint to see if the memory still grows - I didn't observe memory growth when using the HTTP with the reproducer.
  2. Try using tcmalloc/jemalloc - we've encountered memory issues due to the default malloc heuristics. It might be worth trying different malloc libraries to see if that resolves the issue. Please refer here for instructions on using tcmalloc/jemalloc.

Also, could you please share the engine built with the flag trtexec --timingCacheFile=timing.cache? This flag ensures that we use the exact same engine build to run on an A40. I can rerun the experiment with the same engine to check if memory growth is reproducible.

krishung5 avatar Feb 20 '24 17:02 krishung5

Hi @krishung5 , thanks for detailed answer. However, it seems it is capping at some point but it is not. It slows down in growing but eventually it will grow. You can even see from your graph there is a small incline toward up it is obvious for 1024. Even if it caps at some point (small increases stop), still we cannot afford that much of RAM usage. We have more concurrent requests than 20 (around 80 to 100) and our grpc clients are being closed re-opened while triton server is alive. Also, we have bigger models so this RAM usage is not affordable for our cases. As I said in my previous response, the OOM issue arises in around 6 days when 60 to 80 consecutive requests are present that is around 108 milyon requests in 6days. If we at least can stop that grow like setting a flag or something not caching this many data without effecting latency that much that would be great because how long triton server will last is indefinite from our standing point (at least we want to serve until we stopped).

Note that I have tried both options with HTTP and tcmalloc result were same at some point memory usage come to an unaffordable point.

Here is the model plan file that you wanted. I tried again and RES memory in top for triton server were 7.5 and become 16.3.

https://drive.google.com/file/d/1ijpmGoZAK8jUJFZETkjTXU4i6jT7yReG/view?usp=sharing

apokerce avatar Feb 21 '24 16:02 apokerce

Hi @apokerce, thank you for providing the files. I reran the test, and the results are basically the same. The memory usage for GRPC remains unchanged just like the above graph, and using the HTTP client does not result in any memory increase. Triton does not enable caching by default, so the reproducer does not use any caching. There might be some regression on the GRPC side, and I'm currently working on obtaining the memory footprint and analyzing it. Additionally, I'm collecting data with much higher concurrency and longer duration.

Could you please confirm if you are still observing memory growth and eventually encountering OOM errors when every client uses HTTP to send requests? I'm curious if there is any difference in memory usage between using GRPC and HTTP. Thank you!

krishung5 avatar Mar 07 '24 15:03 krishung5

Hi @krishung5 , unfortunately we are not able to change to http directly but I will try the production with perf_analyzer using http client and get back to you. Last we tried we were observing but I will re-run just in case. Memory grow unfortunately still exists in our production as I said it is slow progress and one my observation was sometimes it does not change for hours in production and I see sudden increase in the RES memory.

apokerce avatar Mar 08 '24 18:03 apokerce

Hi @krishung5, the 24.02 version of Tritonserver seems promising in terms of RAM grow. I will do more tests and see if the grow problem is resolved there. I will fill here about whether it is solved or not.

apokerce avatar Mar 28 '24 16:03 apokerce

There is 0.1 g grow per day with 24.02 with flags. It has been 5 days perf analyzer working there is no cut-off yet with grpc with http I did not wait this long. It is stable grow compared to our prod version I did not see sudden jumps. --buffer-manager-thread-count=16 --rate-limit=off --grpc-infer-allocation-pool-size=0 --cuda-memory-pool-byte-size 0:1990656000 --pinned-memory-pool-byte-size=1073741824 --backend-config=python,shm-default-byte-size=67108864,shm-growth-byte-size=134217728

apokerce avatar Apr 03 '24 08:04 apokerce

Thanks for the info, @apokerce ! I'm running the experiment with the command you provided to see if I could see the same. Bug fixes are included in newer version so it's great to see that using 24.02 seems promising. I was wondering if the OOM still happens with 24.02, or even 24.03? I think there may be some regression in grpc, but would be great if you could run http client to see if there's difference in terms of RAM grow. Will verify this on my side in the mean time.

krishung5 avatar Apr 12 '24 18:04 krishung5

Hi @krishung5 , I was not able to test out 24.03. For 24.02, we will see whether we will get OOM but results will not be available soon. For 24.02 with http client, I tested sent requests for 2 days with following command

for i in $(seq 1 1); do

    perf_analyzer -m $MODEL_NAME --concurrency-range=60 --input-data=random  -b 1 -u localhost:8000 -i http --measurement-interval 1000000000
done

The memory started at 2.8 and still seems to be 2.8. I guess your suspicion is correct there is a difference between grpc and http. We will look into feasibility of http client for our use case but at first look it seems it adds a bit overhead (when sends the data perhaps).

apokerce avatar Apr 22 '24 18:04 apokerce

Hi @apokerce, would be great if you could confirm if OOM still happens. From my end, using the http client doesn't introduce any memory growth: image

For using grpc client with ToT Triton, I'm not seeing any growth either: image

Sharing my test commands for reference: Launch Triton

tritonserver --model-repository=models --grpc-infer-allocation-pool-size=1024 --allow-metrics=false --buffer-manager-thread-count=16 --rate-limit=off --grpc-infer-allocation-pool-size=0 --cuda-memory-pool-byte-size 0:1990656000 --pinned-memory-pool-byte-size=1073741824 --backend-config=python,shm-default-byte-size=67108864,shm-growth-byte-size=134217728

Launch PA

# Run perf_analyzer and gather the RAM usage to the file within the for loop
for i in $(seq 1 10000); do
    echo 'count' $i
    # show memory
    echo -e "=====iteration $i=====" >> $LOG_FILE
    free -m >> $LOG_FILE
    free -m | tr -s ' ' | cut -d ' ' -f 3 | sed -n 2p >> $LOG_FILE_MEM

    perf_analyzer -m $MODEL_NAME --concurrency-range=100 --input-data=random --shape=input__0:3,640,640 -b 1 -u localhost:8001 -i grpc --measurement-interval 10000 &
    perf_analyzer -m $MODEL_NAME --concurrency-range=100 --input-data=random --shape=input__0:3,640,640 -b 1 -u localhost:8001 -i grpc --measurement-interval 10000
done

krishung5 avatar Apr 25 '24 22:04 krishung5

Hi @krishung5 , I run the grpc and http for several days. I guess your period is not same as my tests so in a day I also do not see any grow. Sure I will share if we get any OOMs after 24.02.

apokerce avatar Apr 30 '24 17:04 apokerce

Any news about this ?

I'm experiencing growth over time on my triton server on production and wonder if there any advancement on this, I'm working on making it reproductible easily.

sboudouk avatar Jun 11 '24 15:06 sboudouk

With HTTP client we did not see grow for 2 days. Waiting on production tests to see whether we will see OOM issue.

apokerce avatar Jun 13 '24 13:06 apokerce

Thanks, keep us updated !

sboudouk avatar Jun 13 '24 14:06 sboudouk