TensorRT icon indicating copy to clipboard operation
TensorRT copied to clipboard

Mixed precision(FP16+FP32) engine's memory size

Open oreo-lp opened this issue 3 years ago • 10 comments

Description

When I use the mixed precision(fp16+fp32) engine, I find that the memory size of the mixed precision engine is very close to the fp32 engine. Here are ops for setting fp32: (1) fp32 engine: mem size is 2.28GB, inference time is 41ms (2) only Pow ops for fp32: memory size is 1.14GB, infer time is 40ms. The number of Pow ops is 49. (3) only Matmul ops and Pow ops for fp32: memory size is 2.27GB, infer time is 40ms. The number of Pow is 49, and Matmul is 2.

I am very confused with (3), why do I add two Matmul ops, (3) memory size and infer time are very close to the fp32 engine?The mixed engine has been fp32? Why?

This model is wav2vec's transformer.

Environment

TensorRT Version: 8.4 NVIDIA GPU: T4 NVIDIA Driver Version: CUDA Version: 10.2 CUDNN Version: Operating System: Linux Python Version (if applicable): 3.7 Tensorflow Version (if applicable): PyTorch Version (if applicable): Baremetal or Container (if so, version):

oreo-lp avatar Aug 10 '22 07:08 oreo-lp

One question here: how do you compute the memory size?

In the TRT verbose log there will be memory info about the engine. e.g.

54992 [08/09/2022-22:08:24] [I] Engine built in 150.317 sec.
54993 [08/09/2022-22:08:24] [I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 1592, GPU 9538 (MiB)
54994 [08/09/2022-22:08:24] [I] [TRT] Loaded engine size: 25 MiB
54995 [08/09/2022-22:08:24] [V] [TRT] Deserialization required 18246 microseconds.
54996 [08/09/2022-22:08:24] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +24, now: CPU 0, GPU 24 (MiB)
54997 [08/09/2022-22:08:24] [I] Engine deserialized in 0.0211171 sec.
54998 [08/09/2022-22:08:24] [V] [TRT] Total per-runner device persistent memory is 0
54999 [08/09/2022-22:08:24] [V] [TRT] Total per-runner host persistent memory is 109888
55000 [08/09/2022-22:08:24] [V] [TRT] Allocated activation device memory of size 8028160
55001 [08/09/2022-22:08:24] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +8, now: CPU 0, GPU 32 (MiB)

zerollzeng avatar Aug 10 '22 14:08 zerollzeng

One question here: how do you compute the memory size?

In the TRT verbose log there will be memory info about the engine. e.g.

54992 [08/09/2022-22:08:24] [I] Engine built in 150.317 sec.
54993 [08/09/2022-22:08:24] [I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 1592, GPU 9538 (MiB)
54994 [08/09/2022-22:08:24] [I] [TRT] Loaded engine size: 25 MiB
54995 [08/09/2022-22:08:24] [V] [TRT] Deserialization required 18246 microseconds.
54996 [08/09/2022-22:08:24] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +24, now: CPU 0, GPU 24 (MiB)
54997 [08/09/2022-22:08:24] [I] Engine deserialized in 0.0211171 sec.
54998 [08/09/2022-22:08:24] [V] [TRT] Total per-runner device persistent memory is 0
54999 [08/09/2022-22:08:24] [V] [TRT] Total per-runner host persistent memory is 109888
55000 [08/09/2022-22:08:24] [V] [TRT] Allocated activation device memory of size 8028160
55001 [08/09/2022-22:08:24] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +8, now: CPU 0, GPU 32 (MiB)

On the hardware, not in the verbose, are’t they the same thing?what about the inference time?

oreo-lp avatar Aug 10 '22 14:08 oreo-lp

How can I make sure that other ops are fp16? How can I get the data type of each layers in engine?

oreo-lp avatar Aug 10 '22 15:08 oreo-lp

You can see it in the verbose log. try searching "Engine Layer Information"

zerollzeng avatar Aug 10 '22 15:08 zerollzeng

On the hardware, not in the verbose, are’t they the same thing

hardware memory usage usually contains other module like CUBLAS and CUDNN. they are not the same thing.

what about the inference time?

using mix precision may introduce extra data reformat overhead between different precision layers. they can be seen in the verbose log or the per-layer profile. I would suggest to use trtexec as a reference, it's out-of-box.

zerollzeng avatar Aug 10 '22 15:08 zerollzeng

Is there any way to get the data type(fp16 or fp32) of layers in mixed engine during inferencing?

oreo-lp avatar Aug 10 '22 15:08 oreo-lp

I have check the verbose information, (1)'s engine size is 2334MiB(~2.28Gb), (3)'s engine size is 2323Mib(~2.27Gb), why there are so close? (3)'s layers all have turn to be fp32?

oreo-lp avatar Aug 11 '22 00:08 oreo-lp

Is there any way to get the data type(fp16 or fp32) of layers in mixed engine during inferencing?

No, it's only logging in the build phase.

I have check the verbose information, (1)'s engine size is 2334MiB(~2.28Gb), (3)'s engine size is 2323Mib(~2.27Gb), why there are so close?

It mainly depends on the weights, e.g. pow has no parameter so it won't affect the engine size when serialize no matter fp32 or fp16. Same as matmul too, if two inputs are all tensors. on the opposite, you will see the conv layer has size reduction when using fp16. e.g. check the size between trtexec --onnx=/usr/src/tensorrt/data/resnet50/ResNet50.onnx --saveEngine=fp32.plan and trtexec --onnx=/usr/src/tensorrt/data/resnet50/ResNet50.onnx --fp16 --saveEngine=fp16.plan. you will find one is 99MB while the other is only 50MB.

(3)'s layers all have turn to be fp32?

I don't think so. but TRT may use fp32 for some layer even if you specify fp16 for it. it depends on which is faster. But you can force the precision with trtexec --precisionConstraints option, we also have api for it, please check the api doc.

zerollzeng avatar Aug 11 '22 13:08 zerollzeng

Is there any way to get the data type(fp16 or fp32) of layers in mixed engine during inferencing?

No, it's only logging in the build phase.

I have check the verbose information, (1)'s engine size is 2334MiB(~2.28Gb), (3)'s engine size is 2323Mib(~2.27Gb), why there are so close?

It mainly depends on the weights, e.g. pow has no parameter so it won't affect the engine size when serialize no matter fp32 or fp16. Same as matmul too, if two inputs are all tensors. on the opposite, you will see the conv layer has size reduction when using fp16. e.g. check the size between trtexec --onnx=/usr/src/tensorrt/data/resnet50/ResNet50.onnx --saveEngine=fp32.plan and trtexec --onnx=/usr/src/tensorrt/data/resnet50/ResNet50.onnx --fp16 --saveEngine=fp16.plan. you will find one is 99MB while the other is only 50MB.

(3)'s layers all have turn to be fp32?

I don't think so. but TRT may use fp32 for some layer even if you specify fp16 for it. it depends on which is faster. But you can force the precision with trtexec --precisionConstraints option, we also have api for it, please check the api doc.

Thanks! By the way, what‘s the api in python corresponds to trtexec --precisionConstraints .I use python to turn onnx to engine.

oreo-lp avatar Aug 11 '22 13:08 oreo-lp

https://docs.nvidia.com/deeplearning/tensorrt/api/python_api/infer/Core/BuilderConfig.html#tensorrt.BuilderFlag

zerollzeng avatar Aug 12 '22 02:08 zerollzeng

closing since no activity for more than 3 weeks, please reopen if you still have question, thanks!

ttyio avatar Dec 06 '22 02:12 ttyio