Allocating TensorRT model on GPU consumes all available Memory
Description
When I llocate a TensorRT model on GPU it consumes all available memory on the device. I have tried instantiating the model with an empty GPU and with another model on the GPU and both work but fill the CUDA memory to full. This seems to indicate that TensorRT is consuming all available memory on GPU.
Environment
TensorRT Version: 8.4.1.5 NVIDIA GPU: Tesla V100 NVIDIA Driver Version: 470.57.02 CUDA Version: 11.7 CUDNN Version: Operating System: Ubuntu Python Version (if applicable): 3.8 PyTorch Version (if applicable): 1.12.0+cu102 Baremetal or Container (if so, version): nvcr.io/nvidia/tensorrt:22.06-py3 but pip install nvidia-tensorrt==8.4.1.5
Steps To Reproduce
from optimum.onnxruntime import ORTQuantizer
from optimum.onnxruntime.modeling_ort import ORTModelForCausalLM
from optimum.onnxruntime.configuration import AutoQuantizationConfig
from transformers import AutoTokenizer, AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained('distilgpt2')
tokenizer = AutoTokenizer.from_pretrained('distilgpt2', padding=True, truncation=True)
qconfig = AutoQuantizationConfig.tensorrt(is_static=False, per_channel=False)
quantizer = ORTQuantizer(model=model, preprocessor=tokenizer, feature="causal-lm")
quantizer.export(
onnx_model_path="model.onnx",
onnx_quantized_model_output_path="model-quantized.onnx",
quantization_config=qconfig,
use_external_data_format=True
)
tensorrt_model = ORTModelForCausalLM.load_model("./model-quantized.onnx", provider="TensorrtExecutionProvider")
I don't see TRT is involved in your Step to Reproduce. More like an issue on optimum or transformers
BTW can you try using trtexec to convert your model? e.g. trtexec --onnx=model-quantized.onnx --int8 --fp16 --saveEngine=model-quantized.onnx.plan
closing since no activity for more than 3 weeks, please reopen if you still have question, thanks!