server
server copied to clipboard
GPU memory leak when loading/unloading models
Description
When cycling through the load model
-> infer
-> unload model
scenario we observe a GPU memory leak.
This only happens when models are in Torchscript format. There is no leak if the same models are converted to ONNX format. Also everything is ok when no inference is requested (only cycling through loading and unloading models).
Triton Information
Are you using the Triton container or did you build it yourself?
Tested with Nvidia's tritonserver:23.01-py3
and tritonserver:23.04-py3
docker images.
To Reproduce
Start Triton server with --model-control-mode=explicit
flag :
docker run -it --rm --gpus all -p8000:8000 -p8001:8001 -p8002:8002 --shm-size 256m -v $PWD:/models nvcr.io/nvidia/tritonserver:23.04-py3 bash -c "tritonserver --model-repository=/models --model-control-mode=explicit --disable-auto-complete-config"
Run a script that loads a model, runs inference and unloads it for few dozen times:
for i in $(seq 1 100); do curl -XPOST http://127.0.0.1:8000/v2/repository/models/1/load; curl -XPOST 127.0.0.1:8000/v2/models/1/infer -H 'Content-Type: application/json' -d @/ml_serving/v2_input.json; curl -XPOST http://127.0.0.1:8000/v2/repository/models/1/unload; echo $i; done
Each ~50 cycles we lose about 1Gb of GPU memory.
Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well).
We use a standard ROBERTa model for classification task.
config.pbtxt:
name: "1"
platform: "pytorch_libtorch"
default_model_filename: "model.pt"
input [
{
name: "input_ids"
data_type: TYPE_INT64
dims: [-1, -1]
},
{
name: "attention_mask"
data_type: TYPE_INT64
dims: [-1, -1]
}
]
output {
name: "logits"
data_type: TYPE_FP32
dims: [-1, 1]
}
To create model.pt
we use following script:
import os
import torch
from transformers import RobertaForSequenceClassification
from transformers import RobertaTokenizerFast
TS_MODEL_PATH = '/ml_serving/models/ts/1/1/model.pt'
os.makedirs(os.path.dirname(TS_MODEL_PATH), exist_ok=True)
class pyTorchToTorchScript(torch.nn.Module):
def __init__(self):
super().__init__()
self.model = RobertaForSequenceClassification.from_pretrained('roberta-base', torchscript=True)
def forward(self, *arg, **kwargs):
x = self.model(*arg, **kwargs)
return x[0]
tokenizer = RobertaTokenizerFast.from_pretrained('roberta-base')
model = pyTorchToTorchScript()
sentences = ['Hello world!', 'Another simple sentence.']
model.eval()
toks = tokenizer(sentences, return_tensors="pt", padding=True, truncation=True, max_length=200)
output = model(toks['input_ids'], toks['attention_mask'])
print('original:', output)
ts_model = torch.jit.trace(model, (toks['input_ids'], toks['attention_mask']), strict=False)
print('torch:', ts_model(toks['input_ids'], toks['attention_mask']))
ts_model.save(TS_MODEL_PATH)
v2_input.json:
{"inputs":[
{"id":10,"name":"input_ids","shape":[10,17],"datatype":"INT64","data":[[0,100,524,3861,5187,165,1044,8,52,33,316,82,11,5,165,4,2],[0,170,32,5,4739,328,2,1,1,1,1,1,1,1,1,1,1],[0,100,524,3861,5187,165,1044,8,52,33,316,82,11,5,165,4,2],[0,170,32,5,4739,328,2,1,1,1,1,1,1,1,1,1,1],[0,100,524,3861,5187,165,1044,8,52,33,316,82,11,5,165,4,2],[0,170,32,5,4739,328,2,1,1,1,1,1,1,1,1,1,1],[0,100,524,3861,5187,165,1044,8,52,33,316,82,11,5,165,4,2],[0,170,32,5,4739,328,2,1,1,1,1,1,1,1,1,1,1],[0,100,524,3861,5187,165,1044,8,52,33,316,82,11,5,165,4,2],[0,170,32,5,4739,328,2,1,1,1,1,1,1,1,1,1,1]]},
{"id":11,"name":"attention_mask","shape":[10,17],"datatype":"INT64","data":[[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1],[1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0],[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1],[1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0],[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1],[1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0],[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1],[1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0],[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1],[1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0]]}]
}
Expected behavior Triton server should release all the memory that was allocated for a model after it got unloaded. In a long run memory utilization expected to be stable.
Thank you for the detailed bug report. We've filed a ticket to investigate.
Hey guys! Any progress on the issue?
Hi, I am also facing the issue and looking for a solution.
Thank you for letting us know. This is still in our queue. We'll investigate soon.
I am encountering this as well.
Thank you for letting us know, Thor.
As an update, we are able to reproduce this on our end as well and have been actively working on it. This was introduced in 22.12 after some changes in the PyTorch upstream that month. It should have been caught by our testing then but was not. We fixed the related tests. We are working with PyTorch folks to provide a fix as soon as we can.
Facing the same issue. Any updates?
Hello, is there a solution or update?
Not yet. We are working on a reproducer running within PyTorch standalone to try to identify the source of the memory growth.
Hello, I'm facing the same issue. Any updates?
Not yet. We do not yet have a reproducer isolated to PyTorch or a root cause identified on the Triton side.
Ref: DLIS-4941. CC: @krishung5