server GPU memory is not released by Triton

Description

Triton does not clear or release GPU memory when there is a pause in inference. In the attached diagrams the same model is being used. It is served via ONNX.

image (1)

image (2)

The GPU memory keeps growing when I do not use the TensorRT optimizations to limit the workspace memory. With the TensorRT optimization it limits the memory, but not fully. When I limit it to 7 GB, 11 GB of GPU memory are used, but that stays constant.

My model config is:

name: "model_name"
platform: "onnxruntime_onnx"
backend: "onnxruntime"
default_model_filename: "model.onnx"
version_policy: {
      all: {}
}
instance_group [
  {
    count: 2
    kind: KIND_GPU
  }
]
max_batch_size: 1000

dynamic_batching { }

Triton Information 24.06-py3

Are you using the Triton container or did you build it yourself?

Container

To Reproduce

Expected behavior A clear and concise description of what you expected to happen.

Sep 04 '24 16:09 briedel

Facing the same issue with 24.01 and 24.08

Sep 11 '24 15:09 susnato

Facing the same issue with 24.01 and 24.08

Try using Jemalloc instead of Malloc.

Jemalloc quickly returns allocated memory

Sep 24 '24 03:09 Goga1992

Got the same problem when using onnxruntime

Sep 26 '24 15:09 anass-al

Same here, is there any news ?

using ONNX backend and a bge m3 onnx model. I fill up the gpu memory and never release...

Nov 11 '24 18:11 CharlesOural

Hi, can you please add this line at the end of config.pbtxt in triton and restart the server and see if this works or not?

parameters { key: "enable_mem_arena" value: { string_value: "1" } }
parameters { key: "memory.enable_memory_arena_shrinkage" value: { string_value: "gpu:0" } }

I might be wrong with the syntax here but basically you need to turn on the shrinkage for Cuda memory arena.

Please let me know if this works otherwise I will test it myself tomorrow and let you know the correct syntax which worked for me in past.

Nov 11 '24 19:11 susnato

let me try rn !

Nov 11 '24 20:11 CharlesOural

Ok this works amazingly well ! Thank you so much.

Quick questions :

As I understand it was not a bug, just the ORT engine keep memory full and allocate inside. Is it better to keep it that way or your params are better on ?
I have a config with a ensemble model:

name: "ensemble_bge-m3"
# max_batch_size: 8  # maximum batch size
platform: "ensemble"  # ensemble model architecture is used to chain tokenization with model inference

#input to the ensemble model
input [
    {
        name: "TEXT"
        data_type: TYPE_STRING
        dims: [ -1 ]
        # -1 means dynamic seq length, aka this dimension may change at runtime
    }
]

#output of the ensemble model
output [
    # these values for the outputs must match those in the output mapping of the last model in the ensemble pipeline
    {
        name: "dense_embeddings"
        data_type: TYPE_FP16
        dims: [ -1, 1024 ]
        # 3 dimensional tensor, where 1st dimension: batch-size
    },
    {
        name: "colbert_embeddings"
        data_type: TYPE_FP16
        dims: [ -1, -1, 1024 ]
        # 4 dimensional tensor, where 1st dimension: batch-size
    }
]

#Type of scheduler to be used
ensemble_scheduling {
    step [  # each step corresponds to a step in the ensemble pipeline
        # Tokenization step
        {
            model_name: "tokenizer_bge-m3"  # must match with the model_name defined in the tokenizer config
            model_version: -1 # -1 means use latest version
                input_map {
                key: "TEXT"  # key is the name of the input of the ensemble model
                value: "TEXT" # value is the name of the input that we are setting to the current step (in the model.py for the tokenizer, we will glob for this input name)
            }
            output_map [
                {
                    key: "input_ids"  # key is the name of the tensor output of the tokenizer
                    value: "input_ids" # value is the name of the tensor output we are setting for the tokenizer
                },
                {
                    key: "attention_mask"
                    value: "attention_mask"
                }
            ]
        },
        # Model Inference step
        {
            model_name: "bge-m3" # must match with the model_name defined in the bge-m3 model config
            model_version: -1 # -1 means use latest version
            input_map [
                # these input maps maps the name of the inputs to the current step to the name of the outputs of the previous step
                {
                    key: "input_ids"  # key is the name of the output of the previous step
                    value: "input_ids" # value is the name of the input of the current step
                },
                {
                    key: "attention_mask"
                    value: "attention_mask"
                }
            ]
            output_map [
                # these output maps maps the name of the outputs of the current step to the name of the outputs of the current model
                {
                    key: "sentence_embedding" # key is the name of the output of the current model
                    value: "dense_embeddings" # value is the name of the output of the current step
                },
                {
                    key: "token_embeddings" # key is the name of the output of the current model
                    value: "colbert_embeddings" # value is the name of the output of the current step
                }
            ]
        }
    ]
}

and

name: "bge-m3"
platform: "onnxruntime_onnx"
backend: "onnxruntime"
default_model_filename: "model_fp16.onnx"

# if tensorrt engine is used instead of onnx, the following platform, backend and default_model_filename should be used instead
# platform: "tensorrt_plan"
# backend: "tensorrt"
# default_model_filename: "model.plan"


max_batch_size: 12
input [
  {
    name: "input_ids"  # must match with ensemble model inputting mapping
    data_type: TYPE_INT64  # is int_64 for onnx
    dims: [ -1 ]  # we leave it as -1 to indicate that seq_len is dynamic. Note that we do not specify batch dimension.
  },
  {
    name: "attention_mask" # must match with ensemble model inputting mapping
    data_type: TYPE_INT64 # is int_64 for onnx
    dims: [ -1 ] # we leave it as -1 to indicate that seq_len is dynamic. Note that we do not specify batch dimension.
  }
]
output [
    {
        name: "sentence_embedding"  # must match with the names of the output tensors defined in the onnx model file
        data_type: TYPE_FP16 # must match with the data type of the output tensor defined in the onnx model file
        dims: [ 1024 ] # must match with the dimension of the output tensor defined in the onnx model file
    },
    {
        name: "token_embeddings" # colbert vector output from BGE-M3 model
        data_type: TYPE_FP16
        dims: [ -1, 1024 ] # colbert will generate a vector per token with 1024 dimensions each for each batch element
    }
]

instance_group [
    {
      count: 1 # number of instances of the model to run on the Triton inference server
      kind: KIND_GPU  # or KIND_CPU, corresponds to whether model is to be run on GPU or CPU
    }
]
dynamic_batching {
   max_queue_delay_microseconds: 10000
}

parameters { key: "enable_mem_arena" value: { string_value: "1" } }
parameters { key: "memory.enable_memory_arena_shrinkage" value: { string_value: "gpu:0" } }

name: "tokenizer_bge-m3"
backend: "python"  # set to python because we are loading and pre-processing the input for tokenizers in python

max_batch_size: 12  # must match the max_batch_size of the ensemble model config
input [
# the input we specify here must match the input specified for the ensemble model config
{
    name: "TEXT"
    data_type: TYPE_STRING
    dims: [ -1 ]
}
]

output [
# the outputs we specify here must match the outputs specified in the input-output mapping of the ensemble model config
{
    name: "input_ids"
    data_type: TYPE_INT64 # this the datatype required by onnx models
    dims: [ -1 ]  # -1 means the seq_len can be variable. Total number of dimensions must be 1 (we do not specify batch
    # dimensions in order leverage on Triton server's dynamic batching)
},
{
    name: "attention_mask"
    data_type: TYPE_INT64  # this the datatype required by onnx models
    dims: [ -1 ]  # -1 means the seq_len can be variable. Total number of dimensions must be 1 (we do not specify batch
    # dimensions in order leverage on Triton server's dynamic batching)
}
]

instance_group [
    {
      count: 1 # number of instances of tokenizer
      kind: KIND_GPU  # or KIND_CPU, corrresponding to where the tokenizer is to be run on
    }
]

Am I doing something wrong with batching that could lead to such memory overload ? I tryed sending batch of 8 and it crashed somehow and if going up to 9 and back to 8 it works....

Nov 11 '24 20:11 CharlesOural

As I understand it was not a bug, just the ORT engine keep memory full and allocate inside. Is it better to keep it that way or your params are better on ?

Yes, "it's a feature not a bug" 😅

I depends on what you are trying to do, if I am in a production cluster, I already know that I am going to use the nodes "only" for deploying my model, so I will keep the settings as it is.

If I am on an non-production machine where I need to constantly deploy more than one model at a time, then it would be "very" bad if a single model takes up that much amount of cuda memory and never release it! I would probably use the config in that case.

Am I doing something wrong with batching that could lead to such memory overload ? I tryed sending batch of 8 and it crashed somehow and if going up to 9 and back to 8 it works....

Yes, it's expected, I am explaining the reason according to my understanding -

Suppose you need to infer a batch of 8 and for that you need 4GB of cuda memory so at that time ort will allocate that much amount, now let's suppose that we didn't turn on the shrinkage so it will hold on to that memory and NOT release it -- now you need to infer a batch of 9 examples and that requires 4.2GB of memory, then it will reallocate "additional" 4.2GB of cuda memory (even if we already have 4GB of memory already allocated it will not re-use it).

But if you already have infered for a batch size of 9 or already allocated 4.2GB of memory after this event if you try to pass a batch of 8, which requires less than what is previously allocated, then it will "reuse" that already allocated amount.

That's why you are getting this error. I have no idea on why this happens but it's maybe due to better and efficient cuda memory allocation.

Nov 12 '24 04:11 susnato

I took a quick look over your config files and since you are using default settings, you are actually allocating more memory than you need (in the power of 2), so can you also add this line to the end of the config file too and restart the server?

parameters {{ key: "arena_extend_strategy" value: {{ string_value: "1" }} }}

This should drastically reduce the amount of memory you allocate for each of your infer calls. and you should be able to use batch size of 50+ (considering you had 8 as max before)

Nov 12 '24 04:11 susnato

Hello !

Just tried it out ! It works perfectly, can't go to batch size of 80 but went from 8 to 24 !

If I understand correctly it will allocate exactly what is requested and not power of 2. How did you find out this param ? Is there other params like this one (basic need to know) I need to know about ? Is it ok to use it in prod env ?

We're in a dev env for now but going to production next week, I'll change the two enable_mem_arena params then. If I understood correctly your explanation, without memory won't be freed but on new batch ORT will internally decide to override past batch with it's own optimized algo.

Anyway thank you so much for your help, it means a lot for me !

Nov 12 '24 10:11 CharlesOural

Hi, I am glad that it worked!

The way I would configure my prod environment is - (considering the max batch size is 24 for your gpu)

Change only the memory allocation strategy - parameters {{ key: "arena_extend_strategy" value: {{ string_value: "1" }} }} and keep the other configs as default
pass a payload with max batch size of 24 just right after the server is started.
Then restrict at the client side so that the client will never be able to post a payload more than 24 examples as a batch

I will keep the memory shrinkage as off because separate memory allocation for each request would increase the latency of the service. And since I will send a payload of max batch size at the start of the server, it will allocate the max cuda memory possible from the start and any other request that comes after the first request will be less than the max batch and that will re-use this already allocated memory thus reducing the latency from that point. Also since we are restricting the client side code, it will never send payload more than 24 examples so it will never throw a OOM error.

If I understood correctly your explanation, without memory won't be freed but on new batch ORT will internally decide to override past batch with it's own optimized algo.

Only if the new batch is less than or equal to the previous batch size.

How did you find out this param ? Is there other params like this one (basic need to know) I need to know about ? Is it ok to use it in prod env ?

Yes, it's ok to use it in production environment, you can find more about the params here

Nov 12 '24 11:11 susnato

Thank you so much ! Wish you a very nice week

Nov 13 '24 15:11 CharlesOural

is this issue happens when using onnxruntime or is it a triton thing? because I get the same issue for pytorch_libtorch

Nov 27 '24 10:11 AymanAtiia

Hi @AymanAtiia , it's a onnxruntime thing (as far as I am aware). maybe there are similar parameters that you can tune for pytorch_libtorch to avoid this issue.

Nov 27 '24 10:11 susnato

@susnato Yes. I found the parameter. thank you.

parameters: { key: "ENABLE_CACHE_CLEANING" value: { string_value:"true" } }

Nov 27 '24 11:11 AymanAtiia

@AymanAtiia @susnato But does those work with pytorch or onnx models that are exported to tensorRT plan?

Dec 16 '24 16:12 predondo99

I have an issue with memory allocation/deallocation with this module in the triton server,

` name: "decoder" backend: "onnxruntime" max_batch_size: 0

input [ { name: "ASR", data_type: TYPE_FP32, dims: [ -1, 512, -1 ] }, { name: "F0_PRED", data_type: TYPE_FP32, dims: [ -1, -1 ] }, { name: "N_PRED", data_type: TYPE_FP32, dims: [ -1, -1 ] }, { name: "REF", data_type: TYPE_FP32, dims: [ -1, 128 ] } ]

output [ { name: "AUDIO_OUT", data_type: TYPE_FP32, dims: [ -1, 1, -1 ] } ]

instance_group [ { kind: KIND_GPU, count: 1} ]

` because of this, the latency is increasing for bigger inputs. If I can get maximum memory possible(or specify 4GB/8GB) and use it at every inference.

The inputs and outputs are dynamic axis. So that could be a problem. Is there a way to solve this with onnxruntime parameters?

Nov 25 '25 07:11 Amarnath1906

The inputs and outputs are dynamic axis

Hey as far as I remember I think you can use this arg parameters { key: "enable_mem_arena" value: { string_value: "1" } } and during the startup phase of the server pass the inputs which would need the largest VRAM possible -- in that way it will allocate the max VRAM possible for any inputs and reuse it from the next inference runs (without any extra allocaltion/deallocation) which will save you some latency, also dont forget to add a cap to the size of the next inputs otherwise it will try to allocate again and crash the sever due to gpu oom

Nov 25 '25 08:11 susnato

Thank you! Will try this.

Nov 25 '25 08:11 Amarnath1906