TensorRT icon indicating copy to clipboard operation
TensorRT copied to clipboard

Two of the ForeignNodes consumes 60% inference time, among 1000 nodes

Open CanyonWind opened this issue 3 years ago • 3 comments

Description

When running the tensorrt engine and do profiling (using trtexec), we found that two of the ForeignNode takes 60% inference time. And the total number of nodes in the tensorrt graph is about 1000 -- this huge latency consumption from two nodes doesn't make sense.

Could you please share some guidance on:

  • what is a ForeignNode?
  • any analysis on why it runs so slowly compared to the others, are Myelin operators expected to run so slowly?
  • how could we identify the root cause and is there any suggested fix?

We really need some help on these. Thanks ahead!

, { "name" : "{ForeignNode[ReduceMean_4492...Mul_4557]}", "timeMs" : 3802.94, "averageMs" : 37.2837, "medianMs" : 37.2818, "percentage" : 30.5601 }
, { "name" : "{ForeignNode[ReduceMean_699...Mul_764]}", "timeMs" : 3802.82, "averageMs" : 37.2825, "medianMs" : 37.2818, "percentage" : 30.5591 }

Environment

TensorRT Version: 8.4.3.1 NVIDIA GPU: A100 NVIDIA Driver Version: 470 CUDA Version: 11.4 CUDNN Version: 8.4 Operating System: Ubuntu 20.04 Python Version (if applicable): 3.8 Tensorflow Version (if applicable): NA PyTorch Version (if applicable): Baremetal or Container (if so, version):

Steps To Reproduce

We cannot share the model to reproduce but it's a large diffusion model, unet structure with plenty of Attentions.

CanyonWind avatar Sep 08 '22 00:09 CanyonWind

Screenshot for the profiling visualization (note those two purple ones) image

Percentage sunburst graph image

Both visualizations are drawn using the official tensorrt explorer

CanyonWind avatar Sep 08 '22 04:09 CanyonWind

  • what is a ForeignNode?

ForeignNode is operators handled by Myelin(a non open-source DL compiler) or DLA. usually it is NOT a node but contain many nodes. e.g. Myelin will fuse lots of operators into a big operator(ForeignNode) to improve performance. in you case {ForeignNode[ReduceMean_4492...Mul_4557]} all node between ReduceMean_4492 and Mul_4557.

  • any analysis on why it runs so slowly compared to the others, are Myelin operators expected to run so slowly?

Based on the above, so it's not a problem.

  • how could we identify the root cause and is there any suggested fix?

Based on the above, so it's not a problem.

zerollzeng avatar Sep 09 '22 07:09 zerollzeng

Based on my own experience, {ForeignNode[ReduceMean_4492...Mul_4557]} typically is a layernorm operator fused by myelin, you can check your torch/onnx graph and replace these nodes to a layernorm tensorrt plugin, which is helpful to your model performance.

shuo-ouyang avatar Sep 09 '22 08:09 shuo-ouyang

closing since no activity for more than 14 days, please reopen if you still have question, thanks!

ttyio avatar Dec 12 '22 07:12 ttyio

Is it possible to disable Myelin? Is there a documentation about which nodes are consumed by Myelin and which are not?

My log:

[12/30/2022-15:39:09] [V] [TRT] --------------- Timing Runner: {ForeignNode[transformer.h.0.attn.bias.../Cast]} (Myelin)
[12/30/2022-15:48:29] [V] [TRT] Tactic: 0x0000000000000000 Time: 17.8375
[12/30/2022-15:48:30] [V] [TRT] Fastest Tactic: 0x0000000000000000 Time: 17.8375
[12/30/2022-15:48:30] [V] [TRT] >>>>>>>>>>>>>>> Chose Runner Type: Myelin Tactic: 0x0000000000000000
[12/30/2022-15:48:30] [V] [TRT] Formats and tactics selection completed in 567.208 seconds.
...
[12/30/2022-15:48:47] [V] [TRT] Engine generation completed in 584.689 seconds.

Which likely corresponds to the following part of my graph:

image

I'll try to remove the Cast, but I think it would be great to have guidance in the doc on what is compiled as Myelin ForeignNode and what is not!

For the model https://huggingface.co/anton-l/gpt-j-tiny-random , I have an issue at a different part of the model, still involving a Cast:

[12/30/2022-17:37:33] [V] [TRT] --------------- Timing Runner: {ForeignNode[transformer.h.0.ln_1.bias.../Cast]} (Myelin)
[12/30/2022-17:37:56] [V] [TRT] Tactic: 0x0000000000000000 Time: 1.23026
[12/30/2022-17:37:56] [V] [TRT] Fastest Tactic: 0x0000000000000000 Time: 1.23026
[12/30/2022-17:37:56] [V] [TRT] >>>>>>>>>>>>>>> Chose Runner Type: Myelin Tactic: 0x0000000000000000
[12/30/2022-17:37:56] [V] [TRT] Formats and tactics selection completed in 28.5682 seconds.
...
[12/30/2022-17:37:56] [V] [TRT] Engine generation completed in 29.8719 seconds.

image

fxmarty avatar Dec 30 '22 16:12 fxmarty

Is it possible to disable Myelin? Is there a documentation about which nodes are consumed by Myelin and which are not?

There is no knob to disable Myelin. Usually the boolean op, loop op and transform blocks would go to myelin.

ttyio avatar Jan 13 '23 05:01 ttyio