TensorRT Allocating TensorRT model on GPU consumes all available Memory

Description

When I llocate a TensorRT model on GPU it consumes all available memory on the device. I have tried instantiating the model with an empty GPU and with another model on the GPU and both work but fill the CUDA memory to full. This seems to indicate that TensorRT is consuming all available memory on GPU.

Environment

TensorRT Version: 8.4.1.5 NVIDIA GPU: Tesla V100 NVIDIA Driver Version: 470.57.02 CUDA Version: 11.7 CUDNN Version: Operating System: Ubuntu Python Version (if applicable): 3.8 PyTorch Version (if applicable): 1.12.0+cu102 Baremetal or Container (if so, version): nvcr.io/nvidia/tensorrt:22.06-py3 but pip install nvidia-tensorrt==8.4.1.5

Steps To Reproduce

from optimum.onnxruntime import ORTQuantizer
from optimum.onnxruntime.modeling_ort import ORTModelForCausalLM
from optimum.onnxruntime.configuration import AutoQuantizationConfig
from transformers import AutoTokenizer, AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained('distilgpt2')
tokenizer = AutoTokenizer.from_pretrained('distilgpt2', padding=True, truncation=True)

qconfig = AutoQuantizationConfig.tensorrt(is_static=False, per_channel=False)
quantizer = ORTQuantizer(model=model, preprocessor=tokenizer, feature="causal-lm")
quantizer.export(
    onnx_model_path="model.onnx",
    onnx_quantized_model_output_path="model-quantized.onnx",
    quantization_config=qconfig,
    use_external_data_format=True
)

tensorrt_model = ORTModelForCausalLM.load_model("./model-quantized.onnx", provider="TensorrtExecutionProvider")

Jul 14 '22 07:07 sam-h-bean

I don't see TRT is involved in your Step to Reproduce. More like an issue on optimum or transformers

BTW can you try using trtexec to convert your model? e.g. trtexec --onnx=model-quantized.onnx --int8 --fp16 --saveEngine=model-quantized.onnx.plan

Jul 15 '22 16:07 zerollzeng

closing since no activity for more than 3 weeks, please reopen if you still have question, thanks!

Dec 06 '22 01:12 ttyio