How to convert a AMP trained model to get best performance and speed?
According to the doc: https://docs.pytorch.org/TensorRT/user_guide/mixed_precision.html We can convert model with this project where the param precision are explicitly said in the code. But when I train a model with torch AMP GradScaler where no value precision tagged in model code, Can we use this method to get a conerted chackpoint with best performance and inference speedup?
In fect, we had tried the torch pt->onnx-> tensorrt fp16 pipeline to convert pytorch AMP trained checkpoint into trt model format, but the inference results are noisey. while pt->onnx-> tensorrt fp32 pipeline will get a trt fp32 model the inference slower then what we need.
I'm not sure I fully understand your question and I also don't know too deeply what ONNX-TensorRT does for supporting AMP graphs. For Torch-TRT, there are two modes you can compile with Torch-TensorRT, implicit (default) and explicit typing. Implicit typing ignores things like AMP or .to(dtype=...) casting and allows TensorRT to auto tune precision for best performance. Explicit typing preserves your own specified types such as those that come from AMP (after lowering). GradScalar iirc is for the backwards pass where as TRT only cares about the forward pass so the relevant information would be like what you would get out of torch.amp.autocast. If your code does not preserve these types, then I would recommend implicit typing. Note: in the future, only explicit typing will be supported.
Here is an example
import torch
import torch_tensorrt
class MyModule(torch.nn.Module):
def __init__(self):
super().__init__()
self.a_float32 = torch.rand((2, 2), device="cuda")
self.b_float32 = torch.rand((2, 2), device="cuda")
self.d_float32 = torch.rand((2, 2), device="cuda")
def forward(self, x, y):
# USE AUTOCAST TO SELECT OPERATIONS WHICH SHOULD RUN IN FP16
with torch.autocast(x.device.type, enabled=True):
e_float16 = torch.mm(self.a_float32, self.b_float32)
f_float16 = torch.mm(self.d_float32, e_float16)
g_float16 = torch.mm(f_float16, x)
h_float16 = torch.add(g_float16, y)
return h_float16
module = MyModule().to("cuda")
inp = (torch.tensor([[0.5, 0.5]], device="cuda").T, torch.tensor([0.5], device="cuda"))
ep = torch.export.export(module, inp)
out = ep.module()(*inp)
print(ep)
with torch_tensorrt.dynamo.Debugger(
"graphs",
logging_dir=".",
capture_fx_graph_before=[torch_tensorrt.dynamo.lowering.ATEN_PRE_LOWERING_PASSES.passes[0].__name__],
capture_fx_graph_after=[torch_tensorrt.dynamo.lowering.ATEN_POST_LOWERING_PASSES.passes[-1].__name__],
save_engine_profile=True,
profile_format="trex",
engine_builder_monitor=False,
):
# HERE WE COMPILE THE MODEL WITH TENSORRT BUT USE AMP'S TYPES
trt_mod = torch_tensorrt.compile(ep.module(), arg_inputs=inp, use_explicit_types=True, min_block_size=1)
print(trt_mod)
out_trt = trt_mod(*inp)
print(out - out_trt)
Here you can see before the PyTorch graph gets lowered, that there is an autocast subgraph in fp16 and an output in fp32.
And after Torch-TRT lowering this gets simplified to this graph where constants are folded there are some operators running in FP16 and others in FP32. With use_explicit_typing=True these types are reflected in the TRT directly with cast layers inserted according to PyTorch type promotion rules:
TRT and PyTorch don't use identical numerics which might be the cause of your inference instability but in practice we don't see that effecting model performance (if you do file a bug with https://github.com/NVIDIA/TensorRT). In Torch-TRT specifically we do things like use_fp32_acc=True to keep numerics closer to PyTorch. Unsure about what ONNX does here.
Hi @JohnHerry I have some questions:
when I train a model with torch AMP GradScaler where no value precision tagged in model code
To my best knowledge, AMP training uses torch.autocast and torch.amp.GradScaler together. And if you use torch.autocast, pytorch will convert some ops from fp32 to fp16, and some remain in fp32, per the Pytorch AMP doc. That said, precision of each op is tagged in model. Can you share your minimal model, code, and expectations so that I can investigate for you?
we had tried the torch pt->onnx-> tensorrt fp16 pipeline to convert pytorch AMP trained checkpoint into trt model format, but the inference results are noisey.
How did you do the pt->onnx-> tensorrt fp16 pipeline? Did you use NVIDIA ModelOpt autoCast to convert ONNX model from fp32 to fp16 and then build the fp16 ONNX model to TRT engine?
The model and code can be found there. The flow model can be trained with AMP mode.
As to the fp32 to fp16 pipeline, the tool script is there. the export_onnx.py convert pytorch model into onnx format, which is still fp32. while the export_trt.sh there makes the onnx model into trt format, infect it calls the trtexec command with --fp16 option directly.
We had tried to train the full precision flow model and use those tools to convert to trt, but trt model inference is not stable and randomly generate Nan values. Then we tried the AMP training mode to let model fit half precision parameters, but this pipeline will some time fail to get stable trt model, too. Some times, the training checkpoint can be converted to trt successfully, some times not.
@zewenli98 Hi, zewen, Is there anything different between converting fp32 onnx model to fp16 tensorrt model, with trtexec directly, and adding a step that convert fp32 onnx to mixed onnx with your suggested modelopt.autocast?
In our practice, the pipeline: pytorch AMP ->onnx fp32->trt fp16, the step pytorch AMP->onnx fp32 can be executed just on the same environment of model training machine, but the step onnx fp32 -> trt fp16 should be executed just on the machine that we want to deploy the model on. I did not know how to use the modelopt.autocast, should it be executed on the training machine , or the final depoly machine? if the two machine with different card and cuda versions.
@JohnHerry There are two modes in building TRT engines.
-
Strong typing. In short, it means users have to specify the precision of each op. TRT will respect the precisions while building TRT engines. You can do it by
trtexec --stronglyTyped ...The whole workflow should be something like:pytorch model -> original ONNX --modelopt.autocast--> mix-precision ONNX -> mix-precision TRT engine, whereoriginal ONNX --modelopt.autocast--> mix-precision ONNXshould happen on training machine. -
Weak typing. In short, users don't have to specify the precision of each op. TRT will select the best allowed precision for users. For example, in your case,
trtexec --fp16 ...is using weak typing, which allows default fp32 and you allowed fp16. So TRT will decide which ops should be in fp32 or fp16. The whole workflow should be something like:pytorch model -> original ONNX -> mix-precision TRT engine
For some reason, TRT is going to deprecate weak typing, so for now I recommend you use strong typing, i.e., use modelopt.autocast to convert your ONNX. You can find useful info here: https://nvidia.github.io/TensorRT-Model-Optimizer/guides/8_autocast.html
In Torch-TensorRT, we are considering having the same mechanism as modelopt.autocast to bridge the intermediate steps, hopefully like pytorch model -> mix-precision TRT engine
Thank you very much for the help. we had experienced ‘numerical overflow’ when using the pytorch->onnx fp32 ->trt fp16 pipeline, the generated values are Nan. Yestoday I tried the modelopt.autocast step into the pipe. the command is just like follows:
python -m modelopt.onnx.autocast --onnx_path flow.decoder.estimator.fp32.onnx --output_path flow.decoder.estimator.mix_precision.onnx --providers cuda:0 --log_level DEBUG
The generated flow.decoder.estimator.mix_precision.onn is used as replacement of the flow.decoder.estimator.fp32.onnx to run the trtexec conversion. But the result trtengine still generate Nan values. Is there any suggestions?
If fp32 ONNX works for you, I think the easiest way is to 1) inspect which ops/nodes are overflow in fp16 and 2) add --nodes_to_exclude or --op_types_to_exclude or other related args to prevent them from converting to fp16. If you don't add these args, modelopt.autocast will convert almost all ops to fp16.
@zewenli98 hi, zewen, A new situation to feed back.
as mensioned above, I had tested the pytorch -> onnx fp32 -> onnx mix_precision -> trt fp16 pipeline. and generate Nan results. that is when we load the trt fp16 model with INFO log level settings.
estimator_engine = trt.Runtime(trt.Logger(trt.Logger.INFO)).deserialize_cuda_engine(f.read())
but when I want to see the debug info and change log level as follows:
estimator_engine = trt.Runtime(trt.Logger(trt.Logger.DEBUG)).deserialize_cuda_engine(f.read())
the inference result seems OK. no numerical overflow. It is weird, right? I did not change anything else. Why the engine debug level will affect the inference result? Is it a bug inside the TensorRT V9.3 vresion?
@JohnHerry That's interesting. I don't expect logger level would affect outputs. Can you try the latest TRT 10.x?
@zewenli98 sorry for that our develop env is not convinent to install and test the TRT 10.X.
As to the trt.Logger.DEBUG issure, I had guess that if there anything related with the python -m modelopt.onnx.autocast .... --log_level DEBUG command for onnx autocast, but it happens again even after we had drop the --log_level option on the onnx-autocast step.
And, we had tested two version of this model. on the same env of RTX-4090, CUDA12.2
the version-1, we follows the old pytorch-> onnx -> trt pipeline, here we get a trt fp32 [full precision] model.
the version-2, we follows the new pytorch->onnx->onnx mix_precision[with modelopt.autocast] -> trt fp16 pipeline, here we got a trt engine file with a half-size of version-1.
but during inference, we found that, the version-1, trt fp32 engine, and the version-2, trt fp16 engine, had nearly identical inference speed. that means the version-2 fp16 trtengine, did not save computation cost. but according to the onnx.autocast logs, 67% of ops had been converted to fp16. and the model size indeed be-half. This is also a strange situation.
From the listed Skip nodes and converted nodes, it seems that the modelopt.autocast spkips most high-computation nodes like DIV Mul ADD, while nodes converted are mostly low-compuation, like Constant Unsqueeze Shape, Oh, what is the benifit of this autocast then? only on memory?
As to the trt.Logger.DEBUG issure, I had guess that if there anything related with the python -m modelopt.onnx.autocast .... --log_level DEBUG command for onnx autocast, but it happens again even after we had drop the --log_level option on the onnx-autocast step.
Since it's kind of out of scope here, if you think this is a ModelOpt bug, please feel free to file a bug there.
Besides, I think the benefits of modelopt.autocast include both memory reduction and inference speedup. It "intelligently" selects nodes to keep in fp32 per the doc. If you can share your original fp32 ONNX or at least a segment, I can take a look for you.