server icon indicating copy to clipboard operation
server copied to clipboard

GPU memory leak when loading/unloading models

Open igrinis opened this issue 1 year ago • 11 comments

Description When cycling through the load model -> infer -> unload model scenario we observe a GPU memory leak.

This only happens when models are in Torchscript format. There is no leak if the same models are converted to ONNX format. Also everything is ok when no inference is requested (only cycling through loading and unloading models).

Triton Information Are you using the Triton container or did you build it yourself? Tested with Nvidia's tritonserver:23.01-py3 and tritonserver:23.04-py3 docker images.

To Reproduce Start Triton server with --model-control-mode=explicit flag :

docker run -it --rm --gpus all -p8000:8000 -p8001:8001 -p8002:8002 --shm-size 256m -v $PWD:/models nvcr.io/nvidia/tritonserver:23.04-py3 bash -c "tritonserver --model-repository=/models --model-control-mode=explicit --disable-auto-complete-config"

Run a script that loads a model, runs inference and unloads it for few dozen times:

for i in $(seq 1 100); do curl -XPOST http://127.0.0.1:8000/v2/repository/models/1/load; curl -XPOST 127.0.0.1:8000/v2/models/1/infer -H 'Content-Type: application/json' -d @/ml_serving/v2_input.json; curl -XPOST http://127.0.0.1:8000/v2/repository/models/1/unload; echo $i; done

Each ~50 cycles we lose about 1Gb of GPU memory.

Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well).

We use a standard ROBERTa model for classification task.

config.pbtxt:

name: "1"
platform: "pytorch_libtorch"
default_model_filename: "model.pt"

input [
{
    name: "input_ids"
    data_type: TYPE_INT64
    dims: [-1, -1]
},
{
    name: "attention_mask"
    data_type: TYPE_INT64
    dims: [-1, -1]
}
]

output {
    name: "logits"
    data_type: TYPE_FP32
    dims: [-1, 1]
}

To create model.pt we use following script:

import os
import torch
from transformers import RobertaForSequenceClassification
from transformers import RobertaTokenizerFast

TS_MODEL_PATH = '/ml_serving/models/ts/1/1/model.pt'
os.makedirs(os.path.dirname(TS_MODEL_PATH), exist_ok=True)

class pyTorchToTorchScript(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.model = RobertaForSequenceClassification.from_pretrained('roberta-base', torchscript=True)

    def forward(self, *arg, **kwargs):
        x = self.model(*arg, **kwargs)
        return x[0]

tokenizer = RobertaTokenizerFast.from_pretrained('roberta-base')
model = pyTorchToTorchScript()

sentences = ['Hello world!', 'Another simple sentence.']
model.eval()
toks = tokenizer(sentences, return_tensors="pt", padding=True, truncation=True, max_length=200)
output = model(toks['input_ids'], toks['attention_mask'])
print('original:', output)

ts_model = torch.jit.trace(model, (toks['input_ids'], toks['attention_mask']), strict=False)
print('torch:', ts_model(toks['input_ids'], toks['attention_mask']))
ts_model.save(TS_MODEL_PATH)

v2_input.json:

{"inputs":[
    {"id":10,"name":"input_ids","shape":[10,17],"datatype":"INT64","data":[[0,100,524,3861,5187,165,1044,8,52,33,316,82,11,5,165,4,2],[0,170,32,5,4739,328,2,1,1,1,1,1,1,1,1,1,1],[0,100,524,3861,5187,165,1044,8,52,33,316,82,11,5,165,4,2],[0,170,32,5,4739,328,2,1,1,1,1,1,1,1,1,1,1],[0,100,524,3861,5187,165,1044,8,52,33,316,82,11,5,165,4,2],[0,170,32,5,4739,328,2,1,1,1,1,1,1,1,1,1,1],[0,100,524,3861,5187,165,1044,8,52,33,316,82,11,5,165,4,2],[0,170,32,5,4739,328,2,1,1,1,1,1,1,1,1,1,1],[0,100,524,3861,5187,165,1044,8,52,33,316,82,11,5,165,4,2],[0,170,32,5,4739,328,2,1,1,1,1,1,1,1,1,1,1]]},
    {"id":11,"name":"attention_mask","shape":[10,17],"datatype":"INT64","data":[[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1],[1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0],[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1],[1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0],[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1],[1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0],[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1],[1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0],[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1],[1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0]]}]
}

Expected behavior Triton server should release all the memory that was allocated for a model after it got unloaded. In a long run memory utilization expected to be stable.

igrinis avatar May 23 '23 15:05 igrinis

Thank you for the detailed bug report. We've filed a ticket to investigate.

dyastremsky avatar May 23 '23 23:05 dyastremsky

Hey guys! Any progress on the issue?

igrinis avatar Jun 18 '23 14:06 igrinis

Hi, I am also facing the issue and looking for a solution.

stefan-ax avatar Jun 19 '23 13:06 stefan-ax

Thank you for letting us know. This is still in our queue. We'll investigate soon.

dyastremsky avatar Jun 20 '23 17:06 dyastremsky

I am encountering this as well.

thortom avatar Aug 30 '23 10:08 thortom

Thank you for letting us know, Thor.

As an update, we are able to reproduce this on our end as well and have been actively working on it. This was introduced in 22.12 after some changes in the PyTorch upstream that month. It should have been caught by our testing then but was not. We fixed the related tests. We are working with PyTorch folks to provide a fix as soon as we can.

dyastremsky avatar Aug 30 '23 16:08 dyastremsky

Facing the same issue. Any updates?

bruce99kang avatar Dec 04 '23 07:12 bruce99kang

Hello, is there a solution or update?

kadmor avatar Jan 17 '24 05:01 kadmor

Not yet. We are working on a reproducer running within PyTorch standalone to try to identify the source of the memory growth.

dyastremsky avatar Feb 22 '24 19:02 dyastremsky

Hello, I'm facing the same issue. Any updates?

Kokkini avatar Jun 24 '24 03:06 Kokkini

Not yet. We do not yet have a reproducer isolated to PyTorch or a root cause identified on the Triton side.

Ref: DLIS-4941. CC: @krishung5

dyastremsky avatar Jun 24 '24 16:06 dyastremsky