server GPU memory leak when loading/unloading models

Description When cycling through the load model -> infer -> unload model scenario we observe a GPU memory leak.

This only happens when models are in Torchscript format. There is no leak if the same models are converted to ONNX format. Also everything is ok when no inference is requested (only cycling through loading and unloading models).

Triton Information Are you using the Triton container or did you build it yourself? Tested with Nvidia's tritonserver:23.01-py3 and tritonserver:23.04-py3 docker images.

To Reproduce Start Triton server with --model-control-mode=explicit flag :

docker run -it --rm --gpus all -p8000:8000 -p8001:8001 -p8002:8002 --shm-size 256m -v $PWD:/models nvcr.io/nvidia/tritonserver:23.04-py3 bash -c "tritonserver --model-repository=/models --model-control-mode=explicit --disable-auto-complete-config"

Run a script that loads a model, runs inference and unloads it for few dozen times:

for i in $(seq 1 100); do curl -XPOST http://127.0.0.1:8000/v2/repository/models/1/load; curl -XPOST 127.0.0.1:8000/v2/models/1/infer -H 'Content-Type: application/json' -d @/ml_serving/v2_input.json; curl -XPOST http://127.0.0.1:8000/v2/repository/models/1/unload; echo $i; done

Each ~50 cycles we lose about 1Gb of GPU memory.

Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well).

We use a standard ROBERTa model for classification task.

config.pbtxt:

name: "1"
platform: "pytorch_libtorch"
default_model_filename: "model.pt"

input [
{
    name: "input_ids"
    data_type: TYPE_INT64
    dims: [-1, -1]
},
{
    name: "attention_mask"
    data_type: TYPE_INT64
    dims: [-1, -1]
}
]

output {
    name: "logits"
    data_type: TYPE_FP32
    dims: [-1, 1]
}

To create model.pt we use following script:

import os
import torch
from transformers import RobertaForSequenceClassification
from transformers import RobertaTokenizerFast

TS_MODEL_PATH = '/ml_serving/models/ts/1/1/model.pt'
os.makedirs(os.path.dirname(TS_MODEL_PATH), exist_ok=True)

class pyTorchToTorchScript(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.model = RobertaForSequenceClassification.from_pretrained('roberta-base', torchscript=True)

    def forward(self, *arg, **kwargs):
        x = self.model(*arg, **kwargs)
        return x[0]

tokenizer = RobertaTokenizerFast.from_pretrained('roberta-base')
model = pyTorchToTorchScript()

sentences = ['Hello world!', 'Another simple sentence.']
model.eval()
toks = tokenizer(sentences, return_tensors="pt", padding=True, truncation=True, max_length=200)
output = model(toks['input_ids'], toks['attention_mask'])
print('original:', output)

ts_model = torch.jit.trace(model, (toks['input_ids'], toks['attention_mask']), strict=False)
print('torch:', ts_model(toks['input_ids'], toks['attention_mask']))
ts_model.save(TS_MODEL_PATH)

v2_input.json:

{"inputs":[
    {"id":10,"name":"input_ids","shape":[10,17],"datatype":"INT64","data":[[0,100,524,3861,5187,165,1044,8,52,33,316,82,11,5,165,4,2],[0,170,32,5,4739,328,2,1,1,1,1,1,1,1,1,1,1],[0,100,524,3861,5187,165,1044,8,52,33,316,82,11,5,165,4,2],[0,170,32,5,4739,328,2,1,1,1,1,1,1,1,1,1,1],[0,100,524,3861,5187,165,1044,8,52,33,316,82,11,5,165,4,2],[0,170,32,5,4739,328,2,1,1,1,1,1,1,1,1,1,1],[0,100,524,3861,5187,165,1044,8,52,33,316,82,11,5,165,4,2],[0,170,32,5,4739,328,2,1,1,1,1,1,1,1,1,1,1],[0,100,524,3861,5187,165,1044,8,52,33,316,82,11,5,165,4,2],[0,170,32,5,4739,328,2,1,1,1,1,1,1,1,1,1,1]]},
    {"id":11,"name":"attention_mask","shape":[10,17],"datatype":"INT64","data":[[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1],[1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0],[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1],[1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0],[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1],[1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0],[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1],[1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0],[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1],[1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0]]}]
}

Expected behavior Triton server should release all the memory that was allocated for a model after it got unloaded. In a long run memory utilization expected to be stable.

May 23 '23 15:05 igrinis

Thank you for the detailed bug report. We've filed a ticket to investigate.

May 23 '23 23:05 dyastremsky

Hey guys! Any progress on the issue?

Jun 18 '23 14:06 igrinis

Hi, I am also facing the issue and looking for a solution.

Jun 19 '23 13:06 stefan-ax

Thank you for letting us know. This is still in our queue. We'll investigate soon.

Jun 20 '23 17:06 dyastremsky

I am encountering this as well.

Aug 30 '23 10:08 thortom

Thank you for letting us know, Thor.

As an update, we are able to reproduce this on our end as well and have been actively working on it. This was introduced in 22.12 after some changes in the PyTorch upstream that month. It should have been caught by our testing then but was not. We fixed the related tests. We are working with PyTorch folks to provide a fix as soon as we can.

Aug 30 '23 16:08 dyastremsky

Facing the same issue. Any updates?

Dec 04 '23 07:12 bruce99kang

Hello, is there a solution or update?

Jan 17 '24 05:01 kadmor

Not yet. We are working on a reproducer running within PyTorch standalone to try to identify the source of the memory growth.

Feb 22 '24 19:02 dyastremsky

Hello, I'm facing the same issue. Any updates?

Jun 24 '24 03:06 Kokkini

Not yet. We do not yet have a reproducer isolated to PyTorch or a root cause identified on the Triton side.

Ref: DLIS-4941. CC: @krishung5

Jun 24 '24 16:06 dyastremsky

server server copied to clipboard

GPU memory leak when loading/unloading models

server
server copied to clipboard