server Python backend SHM memory leak

Description I am encountering two possibly related issues with the Python backend and shared memory:

During operation, the shared memory usage keeps growing, leading to errors. It looks like the shared memory regions allocated by the Python backend for its inputs are not recycled. I understand the SHM region grows based on the size of the inputs, but this is an issue especially when multiple model instances are running. Also, it is possible the region grows beyond the largest input if memory is leaked instead of re-used.
After the Triton container is terminated, allocated shared memory regions remain in /dev/shm

Triton Information What version of Triton are you using? 2.47.0

Are you using the Triton container or did you build it yourself? Official containers:

24.06-py3
24.06-trtllm-python-py3

To Reproduce I encountered the issue with any Python-based model I tried:

Python models for image pre-processing
Python BLS models
TensorRT-LLM models using the Python backend (I don't think I can use the tensortt_llm backend for my use case - multimodal model)

Expected behavior

Shm regions would be shrunk, or at least wouldn't grow indefinitely (arena-style allocator?)
Shm regions would be de-allocated when the model shuts down

Jul 27 '24 11:07 mbahri

Hi @mbahri,

Do you have a minimal model, client, and steps you could share for reproducing to help expedite debugging? If it is a generic python backend shm issue, then a simple python model not doing anything interesting (identity, etc.) may be able to reproduce it.

CC @Tabrizian @kthui @krishung5 for viz

Jul 31 '24 22:07 rmccorm4

Hi everyone,

@mbahri, has it already solved? If it has, you could provide an explanation about solution?

I'm facing a similar problem here. We've already wrote a github issue pointing this problem and a ticket was opened, but it is still occurring and we don't have any solution.

@rmccorm4, there are some steps and metrics that could be used to reproduce and analyse the problem. You could check it there: https://github.com/triton-inference-server/server/issues/6720

Aug 21 '24 19:08 rodrigo-orlandini

We are facing the same issues in our models. Any more updates on this?

Also for the second issue where /dev/shm will not be cleaned after container restarts. If you are in k8s environment, we've used a hacky way to clean it once the container restarted so at least container won't end up CLBO because it has no memory available

                  "lifecycle": {
                     "postStart": {
                        "exec": {
                           "command": ["/bin/sh", "-c", "rm -f /dev/shm/*"]
                        }
                     }
                  },

Sep 09 '24 23:09 fangpings

Facing a similar issue when deploying on k8s SHM grows and pod is killed with OOM

Do not encounter this when testing without k8s

Oct 01 '24 13:10 ash2703

Hello @Tabrizian @kthui @krishung5, I have also been running into the same issue with SHM memory leak on Triton 24.04. I noticed this only began when I switched my ensemble model to BLS to add more custom branching. As other commenters have noted, /dev/shm/ fills up and has to be manually cleared between container restarts for me to mitigate the memory leak.

I have attached a valgrind log file with more details from some warmup requests: triton_valgrind.log

You can see in this log file a lot of logs that are specific to BLS (see ExecuteBLSRequest) and shared memory (SaveRequestsToSharedMemory). For example:

==63== Use of uninitialised value of size 8
==63==    at 0x606645F: pthread_cond_broadcast@@GLIBC_2.3.2 (pthread_cond_broadcast.c:76)
==63==    by 0x7F17F29: triton::backend::python::ModelInstanceState::ExecuteBLSRequest(std::shared_ptr<triton::backend::python::IPCMessage>, bool) (in /opt/tritonserver/backends/python/libtriton_python.so)
==63==    by 0x7F1915C: std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<void>, std::__future_base::_Result_base::_Deleter>, std::__future_base::_Task_state<triton::backend::python::ModelInstanceState::ProcessRequests(TRITONBACKEND_Request**, unsigned int, std::vector<std::unique_ptr<triton::backend::python::InferRequest, std::default_delete<triton::backend::python::InferRequest> >, std::allocator<std::unique_ptr<triton::backend::python::InferRequest, std::default_delete<triton::backend::python::InferRequest> > > >&, bool&)::{lambda()#3}, std::allocator<int>, void ()>::_M_run()::{lambda()#1}, void> >::_M_invoke(std::_Any_data const&) (in /opt/tritonserver/backends/python/libtriton_python.so)
==63==    by 0x7F22D1C: std::__future_base::_State_baseV2::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*) (in /opt/tritonserver/backends/python/libtriton_python.so)
==63==    by 0x60674DE: __pthread_once_slow (pthread_once.c:116)
==63==    by 0x7F05378: std::__future_base::_Task_state<triton::backend::python::ModelInstanceState::ProcessRequests(TRITONBACKEND_Request**, unsigned int, std::vector<std::unique_ptr<triton::backend::python::InferRequest, std::default_delete<triton::backend::python::InferRequest> >, std::allocator<std::unique_ptr<triton::backend::python::InferRequest, std::default_delete<triton::backend::python::InferRequest> > > >&, bool&)::{lambda()#3}, std::allocator<int>, void ()>::_M_run() (in /opt/tritonserver/backends/python/libtriton_python.so)
==63==    by 0x7F3AED1: boost::asio::detail::executor_op<boost::asio::detail::binder0<std::packaged_task<void ()> >, std::allocator<void>, boost::asio::detail::scheduler_operation>::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long) (in /opt/tritonserver/backends/python/libtriton_python.so)
==63==    by 0x7F2B91B: boost::asio::detail::scheduler::run(boost::system::error_code&) (in /opt/tritonserver/backends/python/libtriton_python.so)
==63==    by 0x7F2BE8C: boost::asio::detail::posix_thread::func<boost::asio::thread_pool::thread_function>::run() (in /opt/tritonserver/backends/python/libtriton_python.so)
==63==    by 0x7F1F5F3: boost_asio_detail_posix_thread_function (in /opt/tritonserver/backends/python/libtriton_python.so)
==63==    by 0x605E608: start_thread (pthread_create.c:477)
==63==    by 0x657A352: clone (clone.S:95)
==63==  Uninitialised value was created by a heap allocation
==63==    at 0x483BE63: operator new(unsigned long) (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==63==    by 0x7F16112: triton::backend::python::ModelInstanceState::SaveRequestsToSharedMemory(TRITONBACKEND_Request**, unsigned int, std::vector<std::unique_ptr<triton::backend::python::InferRequest, std::default_delete<triton::backend::python::InferRequest> >, std::allocator<std::unique_ptr<triton::backend::python::InferRequest, std::default_delete<triton::backend::python::InferRequest> > > >&, triton::backend::python::AllocatedSharedMemory<char>&, std::shared_ptr<std::vector<TRITONBACKEND_Response*, std::allocator<TRITONBACKEND_Response*> > >&) (in /opt/tritonserver/backends/python/libtriton_python.so)
==63==    by 0x7F1B7B5: triton::backend::python::ModelInstanceState::ProcessRequests(TRITONBACKEND_Request**, unsigned int, std::vector<std::unique_ptr<triton::backend::python::InferRequest, std::default_delete<triton::backend::python::InferRequest> >, std::allocator<std::unique_ptr<triton::backend::python::InferRequest, std::default_delete<triton::backend::python::InferRequest> > > >&, bool&) (in /opt/tritonserver/backends/python/libtriton_python.so)
==63==    by 0x7F1E212: TRITONBACKEND_ModelInstanceExecute (in /opt/tritonserver/backends/python/libtriton_python.so)
==63==    by 0x5140724: triton::core::TritonModelInstance::Execute(std::vector<TRITONBACKEND_Request*, std::allocator<TRITONBACKEND_Request*> >&) (in /opt/tritonserver/lib/libtritonserver.so)
==63==    by 0x51409DA: triton::core::TritonModelInstance::Schedule(std::vector<std::unique_ptr<triton::core::InferenceRequest, std::default_delete<triton::core::InferenceRequest> >, std::allocator<std::unique_ptr<triton::core::InferenceRequest, std::default_delete<triton::core::InferenceRequest> > > >&&) (in /opt/tritonserver/lib/libtritonserver.so)
==63==    by 0x523D57C: triton::core::Payload::Execute(bool*) (in /opt/tritonserver/lib/libtritonserver.so)
==63==    by 0x514493A: triton::core::TritonModelInstance::TritonBackendThread::BackendThread() (in /opt/tritonserver/lib/libtritonserver.so)
==63==    by 0x615F792: ??? (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.32)
==63==    by 0x605E608: start_thread (pthread_create.c:477)
==63==    by 0x657A352: clone (clone.S:95)

This is from a run where I loaded the modules retrieval_bls, signal_client, and inference_retrieval_ensemble. My BLS logic is to essentially run signal_client and if it succeeds run inference_retrieval_ensemble. If signal_client fails, I return early with an empty response.

I can't provide you a fully end-to-end reproducible example at the moment. But I have included below a simplified version of retrieval_bls's model.py and config.pbtxt in case they help.

model.py: https://gist.github.com/lakshbhasin/f276d1253c717ae6111e628a4ad85643
config.pbtxt: https://gist.github.com/lakshbhasin/b86aafeaade178e6fc78790c98dfb0bb

I hope the above valgrind logs are sufficient for you to debug this issue? Thanks a lot for your help.

Edit to add: I undid the BLS change and switched back to ensemble. Instead of using BLS's branching logic to exit early, I just propagated through empty data through the ensemble until the last stage. This is less efficient as it means later stages and queuing apply instead of exiting early. However, it has solved the memory leak. See the graphs below where we used to see an increasing memory usage pattern until the process restarted due to OOMkiller. After the change was rolled out, memory usage is mostly flat and there is no more OOM killer. memory_leak_fixed

Nov 02 '24 22:11 lakshbhasin

Hi @lakshbhasin, thanks for the detailed example! I wonder if any of the raise statements are invoked during the life of the BLS model?

        ...
        if signal_client_response.has_error():
           ...
            # Raise other errors.
            raise pb_utils.TritonModelException(signal_client_response.error().message())
        ...
        if infer_and_retrieve_response.has_error():
            raise pb_utils.TritonModelException(infer_and_retrieve_response.error().message())
        ...

It is because the Python backend may not handle those raised exception graceful enough which may cause issues later. We recommend returning the error in an error response, i.e.

        ...
        if infer_and_retrieve_response.has_error():
            #raise pb_utils.TritonModelException(infer_and_retrieve_response.error().message())
            return [pb_utils.InferenceResponse(error=pb_utils.TritonError(infer_and_retrieve_response.error().message()))]
        ...

see error-handling for more details.

Nov 12 '24 19:11 kthui

Hi @kthui, thanks for taking a look. Yes, those raise statements are invoked. Has this been confirmed as the cause of the memory leak?

The code I shared is similar to the example tutorial here for BLS. Does this example also need to be updated, if it is prone to a memory leak?

Nov 13 '24 06:11 lakshbhasin

Yes, those raise statements are invoked. Has this been confirmed as the cause of the memory leak?

This is not confirmed. Would you be able to help us verify the hypothesis by removing the raise statements and see if the memory leak persisted?

The reason I suspected this might be the cause is because the shared memory are owned by the underlying objects, i.e. response, and the memory will be released automatically when the object goes out of scope. If for some reason the object is not garbage collected, this will cause a memory leak - depending on how long it takes for it to be collected.

Nov 13 '24 18:11 kthui

This is not confirmed. Would you be able to help us verify the hypothesis by removing the raise statements and see if the memory leak persisted?

@lakshbhasin, could you please confirm if you got a chance to verify the above? Thank you.

Nov 21 '24 09:11 pskiran1

Sorry to butt in without too much context, but I'd like to mention that shm memory is never gracefully collected in case of a process termination. Even in pure C, one has to register signal handlers for SIGINT, SIGTERM, SIGQUIT, and even all of the SIGILL, SIGFPE, and other SIGSEGV and SIGBUS, in order to close the opened shared memory blocks.

Nov 21 '24 17:11 nicolasnoble

This is not confirmed. Would you be able to help us verify the hypothesis by removing the raise statements and see if the memory leak persisted?

Hi @pskiran1 @kthui: unfortunately, I don't think I'll have time to try this out. We had to refactor a lot of the code to move away from BLS, and have already built on top of that new code for multiple services. It would have to be a business priority for me to switch back to and re-evaluate BLS, which I don't see happening in the next few months.

If you all do think this can cause shared memory leaks, you may at least want to edit the tutorial docs for now to replace the raise logic? (This is just one example – there are other files with the same logic in the BLS tutorials.)

Nov 25 '24 20:11 lakshbhasin

@lakshbhasin, @mbahri, @rodrigo-orlandini, @fangpings, we tried to reproduce the issue using the provided specifications, including multiple instances and the example BLS model (BLS or other model raises errors), while sending numerous requests both synchronously and asynchronously. However, we were unable to reproduce the problem. Could you please provide some additional inputs or minimal issue repro models for more effective debugging?

Nov 28 '24 13:11 pskiran1

Hi, I'm experiencing the same issue on one of my servers, which operates on k8s. I have a Python BLS model that orchestrates several other models, including a Python video downloader and decoder, three Python wrappers for PyTorch/TensorFlow models, a preprocessor, and a custom backend model in C++ based on the Vosk library. The shared memory continues to increase and isn't being freed up throughout the lifecycle. At times, this causes my pod to get killed due to out-of-memory (OOM) errors, and only recreating the pod helps. Internally, I'm using DLPack, and I'm planning to remove this to see if it solves the issue.

Dec 04 '24 21:12 gerasim13

I am also dealing with exceptions rising in my dependent models, like this particular scenario (one chunk with an exception):

audio = audio.reshape([1, len(audio)])
transcribed = np.full(
    shape=[1, 1],
    fill_value=transcribed.encode('utf-8'),
    dtype=object
)
infer_inputs = [
    pb_utils.Tensor('phrases', transcribed),
    pb_utils.Tensor('audio', audio),
]
infer_request = pb_utils.InferenceRequest(
    request_id=asset_uid,
    inputs=infer_inputs,
    requested_output_names=['result'],
    model_name=self.diarization_model_name,
    model_version=self.diarization_model_version,
    timeout=self.diarization_model_timeout,
)

if request.is_cancelled():
    return None

r = infer_request.exec(decoupled=False)
if r.has_error():
    raise pb_utils.TritonModelException(
        r.error().message()
    )

Dec 04 '24 22:12 gerasim13

I'm using DLPack, and I'm planning to remove this to see if it solves the issue.

@gerasim13, did removing DLPack or the raise statements (as mentioned in the earlier comments) help resolve the issue? As requested earlier, it would be greatly appreciated if you could share a sample repro. Thank you.

Dec 20 '24 10:12 pskiran1

No, it didn't help. The memory is not freed after passing binary data as tensors (like video, audio, and images), and eventually the server stops responding, entering an infinite reboot loop because of OOM. I have to delete the pod and restart it over and over again. As a quick fix for this problem, I came up with a workaround, write this data to the /tmp folder as files, and then pass their paths from my BLS to other dependent models as tensors.

Dec 24 '24 15:12 gerasim13

After adjusting the SHM volume settings using this guide https://www.and-fs.de/posts/kubernetes/k8s-shmsize/, the issue was resolved.

Dec 31 '24 11:12 gerasim13

No, it seems that the problem is still present. After a few days of operation, the server began to give such errors:

E0102 20:59:02.073741 1 backendmodel.cc:692] "ERROR: Failed to create instance: Unable to initialize shared memory key 'tritonpythonbackendshmregion74685718-b0c1-4470-b9ae-f1eca657c270' to requested size (536870912 bytes). If you are running Triton inside docker, use '--shm-size' flag to control the shared memory region size. Each Python backend model instance requires at least 1 MB of shared memory. Error: No space left on device"

In the /dev/shm folder, I see a bunch of "files" that the server did not delete although according to the logs, it was restarted more than once.

Jan 02 '25 21:01 gerasim13

@gerasim13, if possible could you please help us with the issue repro to try from our end? Thank you.

Jan 19 '25 12:01 pskiran1

@gerasim13, if possible could you please help us with the issue repro to try from our end? Thank you.

I hope this helps: https://github.com/gerasim13/triton-server-shm-bug-investigation

Apr 10 '25 16:04 gerasim13

@gerasim13, thank you. I can reproduce the issue.

Apr 15 '25 11:04 pskiran1

Any update on this? I had the same problem

Apr 27 '25 15:04 hoangphuc1998

@pskiran1 We keep increasing shm size but it always end up crashing after a few days. Do you know if something can be done ?

Jun 04 '25 20:06 gaetansnl

@gaetansnl, @hoangphuc1998, We will prioritize this issue soon. The workaround to unblock the issue, as mentioned in the comment here, please return the exception gracefully.

Jun 05 '25 09:06 pskiran1

Any updates here? We are having the same issue as well. every new request is pilling up in the shared memory.

Jul 28 '25 17:07 ChristosCh00

@ChristosCh00, could you please try the workaround https://github.com/triton-inference-server/server/issues/7481#issuecomment-2943384163?

Jul 28 '25 17:07 pskiran1

We are experiencing the same issue on 24.06. Wrapping exceptions in InferenceResponse didn't help. The only thing we noticed on our setup that may help is that model can't unload properly. Request to /index returns infinite UNLOADING state, corresponding processes are not shut down, memory is not freed. When we send ${MODEL_NAME}/load request, model's state in /index changes to READY, but unloading instances continue hanging.

Aug 08 '25 07:08 grk717

We are experiencing the same issue on 24.06. Wrapping exceptions in InferenceResponse didn't help. The only thing we noticed on our setup that may help is that model can't unload properly. Request to /index returns infinite UNLOADING state, corresponding processes are not shut down, memory is not freed. When we send ${MODEL_NAME}/load request, model's state in /index changes to READY, but unloading instances continue hanging.

@grk717, this seems to be a new issue. Please try the latest version 25.07. If you continue to experience the same issue, please share the minimal steps/resources to reproduce. Thank you.

Aug 11 '25 05:08 pskiran1

server server copied to clipboard

Python backend SHM memory leak

server
server copied to clipboard