TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

[Bug]: Suspicious memory leak in process "/usr/bin/python -R -m mpi4py.futures.server"

Open harryjing opened this issue 1 month ago • 7 comments

System Info

System Info GPU: H20 96G * 8 Environment: 1.2.0.rc1 and 1.0.0.rc4, both versions have memory leak problem, it seems not related to a specific version Model: Qwen series, we tested Qwen3-8B, Qwen3-32B and QwQ-32B, it is not related to a specific model.

Who can help?

No response

Information

  • [ ] The official example scripts
  • [ ] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [ ] My own task or dataset (give details below)

Reproduction

Keep sending requests and observe the memory usage, we can see the memory usage will continue to grow. When test started: For example, we tested Qwen3-8B with continuous serial transmission of requests "Write a short essay of 1000 words."

Image

After 12 hours:

Image

Expected behavior

No memory leak

actual behavior

Some online services (Qwen2.5-7B / QwQ-32B)have already triggered OOM (Out of Memory) errors, causing process crashes.

additional notes

We already tried [https://github.com/NVIDIA/TensorRT-LLM/issues/6901][Fix]: Breaking Change: disable nvtx annotation by default. However, this fix doesn't seem to have worked.

Before submitting a new issue...

  • [x] Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.

harryjing avatar Nov 21 '25 03:11 harryjing

@harryjing That fix didn't finally merge. Have you tried setting NVTX_DISABLE to 1?

troycheng avatar Nov 21 '25 04:11 troycheng

@harryjing That fix didn't finally merge. Have you tried setting NVTX_DISABLE to 1?

Is the NVTX_DISABLE flag only effective before building wheel package? If I directly modify the Python file to disable the NVTX annotation, will it have the same effect?

harryjing avatar Nov 21 '25 05:11 harryjing

Set the environment variable NVTX_DISABLE to 1 before running the trtllm-serve instance.

troycheng avatar Nov 21 '25 05:11 troycheng

Set the environment variable NVTX_DISABLE to 1 before running the trtllm-serve instance. I set the environment variable NVTX_DISABLE to 1, but the memory seems to be increasing; it went from 3GB to 6GB in 3 hours. Then I stopped sending requests, and there seems to be no sign of a decrease.

Image

harryjing avatar Nov 21 '25 09:11 harryjing

In our online environment, we use versions 1.1.0 rc5 and 1.2.0 rc0. By setting NVTX_DISABLE=1 to disable the profiling feature in NVTX, we no longer observe significant memory leaks. For example, the RSS memory of an online service that contains both persistent and dynamic scaling instances.

Image

The root cause of NVTX causing persistent memory growth in Python processes is explained at https://github.com/NVIDIA/TensorRT-LLM/issues/6901. Perhaps you could try double-checking whether the environment variable setting has taken effect or running another profiling analysis to confirm whether it's the same issue.

troycheng avatar Nov 21 '25 12:11 troycheng

In our online environment, we use versions 1.1.0 rc5 and 1.2.0 rc0. By setting NVTX_DISABLE=1 to disable the profiling feature in NVTX, we no longer observe significant memory leaks. For example, the RSS memory of an online service that contains both persistent and dynamic scaling instances.

Image The root cause of NVTX causing persistent memory growth in Python processes is explained at [#6901](https://github.com/NVIDIA/TensorRT-LLM/issues/6901). Perhaps you could try double-checking whether the environment variable setting has taken effect or running another profiling analysis to confirm whether it's the same issue.

I used NVTX_DISABLE=1, as shown below, and the memory usage is still increasing. My version is 1.2.0rc1. I installed by tensorrt_llm-1.2.0rc1-cp310-cp310-linux_x86_64.whl

My trtllm-serve command: trtllm-serve /mnt/modelops/models/Qwen3-8B --host 127.0.0.1 --port 9123 --kv_cache_free_gpu_memory_fraction 0.7 --trust_remote_code --log_level info --tp_size 4 --ep_size 1 --pp_size 1 --max_batch_size 8 --max_num_tokens 8192 --tokenizer /mnt/modelops/models/Qwen3-8B --backend pytorch --extra_llm_api_options /tmp/extra_llm_api_options.yaml

My extra_llm_api_options.yaml file: allreduce_strategy: AUTO attn_backend: TRTLLM cuda_graph_config: enable_padding: true max_batch_size: 8 disable_overlap_scheduler: false dtype: auto enable_attention_dp: false enable_chunked_prefill: true enable_iter_perf_stats: true enable_iter_req_stats: true kv_cache_config: dtype: auto enable_block_reuse: true free_gpu_memory_fraction: 0.85 print_iter_log: false return_perf_metrics: true sampler_type: auto scheduler_config: context_chunking_policy: FIRST_COME_FIRST_SERVED

RSS continues to grow, but it seems to differ somewhat from the phenomenon observed with #6901 . Image

UPDATE: Memory issues occur when I enable these two options. enable_iter_perf_stats: true enable_iter_req_stats: true

harryjing avatar Nov 24 '25 03:11 harryjing

In our online environment, we use versions 1.1.0 rc5 and 1.2.0 rc0. By setting NVTX_DISABLE=1 to disable the profiling feature in NVTX, we no longer observe significant memory leaks. For example, the RSS memory of an online service that contains both persistent and dynamic scaling instances. Image The root cause of NVTX causing persistent memory growth in Python processes is explained at [#6901](https://github.com//issues/6901). Perhaps you could try double-checking whether the environment variable setting has taken effect or running another profiling analysis to confirm whether it's the same issue.

I used NVTX_DISABLE=1, as shown below, and the memory usage is still increasing. My version is 1.2.0rc1. I installed by tensorrt_llm-1.2.0rc1-cp310-cp310-linux_x86_64.whl

My trtllm-serve command: trtllm-serve /mnt/modelops/models/Qwen3-8B --host 127.0.0.1 --port 9123 --kv_cache_free_gpu_memory_fraction 0.7 --trust_remote_code --log_level info --tp_size 4 --ep_size 1 --pp_size 1 --max_batch_size 8 --max_num_tokens 8192 --tokenizer /mnt/modelops/models/Qwen3-8B --backend pytorch --extra_llm_api_options /tmp/extra_llm_api_options.yaml

My extra_llm_api_options.yaml file: allreduce_strategy: AUTO attn_backend: TRTLLM cuda_graph_config: enable_padding: true max_batch_size: 8 disable_overlap_scheduler: false dtype: auto enable_attention_dp: false enable_chunked_prefill: true enable_iter_perf_stats: true enable_iter_req_stats: true kv_cache_config: dtype: auto enable_block_reuse: true free_gpu_memory_fraction: 0.85 print_iter_log: false return_perf_metrics: true sampler_type: auto scheduler_config: context_chunking_policy: FIRST_COME_FIRST_SERVED

RSS continues to grow, but it seems to differ somewhat from the phenomenon observed with #6901 . Image

UPDATE: Memory issues occur when I enable these two options. enable_iter_perf_stats: true enable_iter_req_stats: true

We found that if you set enable_iter_perf_stats=true, enable_iter_req_stats=true, and tp=4, the stats in the child worker’s py_executor will grow indefinitely, which causes a memory leak.

Image Maybe we can set a max length of this array like the PR #9257 ?

zhanghaotong avatar Nov 25 '25 03:11 zhanghaotong