tensorrt No improvement in GPU memory consumption during inference

I have convertd matterport implementation of Mask RCNN saved model to a 16-bit TRT optimized saved model. I can see 100ms improvement in the inference time, however, I do not see any reduction in GPU memory consumption. Given that the original model is 32-bit model, and the optimized model is 16-bit model, I am expecting some reduction in the GPU memory consumption during inference.

I used: Tensorflow 2.10.0 Tensorrt 7.2.2.1 Colab pro+

No one talks about the GPU memory consumption after optimization. Is it only the inference time that is improved by TF-TRT?

Oct 17 '22 01:10 vedanshthakkar

In general TF-TRT focuses on inference performance, and unfortunately memory consumption is rarely improved. TensorRT itself does a much better job at memory reduction than TF-TRT if memory size is critical for your application.

CC: @pjannaty for memory consumption issue.

Oct 17 '22 15:10 ncomly-nvidia

@ncomly-nvidia @pjannaty Understood, however, if I am optimizing a 32 bit model and using precision_mode='FP16' as one of the conversion parameters, my understanding is that the weights of the converted/optimized model should be FP16. And if that is the case, the model should now take ~half the memory during inference. Am I missing something?

Oct 19 '22 15:10 vedanshthakkar

It's hard to tell why TRT does not show memory usage reduction here. We do have an experimental PR that you may want to use at your discretion to see if it helps with the issue: https://github.com/tensorflow/tensorflow/pull/55959

Oct 20 '22 00:10 pjannaty