Triton-TensorRT-Inference-CRAFT-pytorch icon indicating copy to clipboard operation
Triton-TensorRT-Inference-CRAFT-pytorch copied to clipboard

speedup

Open ltm920716 opened this issue 3 years ago • 10 comments

hello, I found that there is speedup using tensorrt(fp32, fp16) inference, is that right?

And I found that batch inference for torch model has no speedup too. I do not know if there is something wrong for me

ltm920716 avatar Aug 23 '21 01:08 ltm920716

Hi @ltm920716, yes, tensorRT (RT) has speedup inference, cause it optimized the model for inference in specific GPU it built (then infer) on. You mean batch inference for torch model is traditional .pth inference? If .pth then my repo doesn't enhance it. If use RT in Triton then we can further improve batch inference by optimize batch inference in Triton server (https://github.com/triton-inference-server/server/blob/main/docs/architecture.md#dynamic-batcher)... FYI, tensorRT and Triton can bring further performance as we apply more optimizations on them:

  • See working on* items: https://docs.nvidia.com/deeplearning/tensorrt/index.html

k9ele7en avatar Aug 23 '21 01:08 k9ele7en

@k9ele7en ,I test traditional .pth model on Tesla V100(32G), and I found that there is no speedup. So I think maybe the network layer is too large to speedup with batch inference,I am so confused.

ltm920716 avatar Aug 23 '21 02:08 ltm920716

@ltm920716 , you test .pth on local or .pt (Torchscript) on Triton server (put in Model Repository)?

k9ele7en avatar Aug 23 '21 02:08 k9ele7en

@ltm920716 , you test .pth on local or .pt (Torchscript) on Triton server (put in Model Repository)?

@k9ele7en I test original torch model craft_mlt_25k.pth on torch==1.7.0 with batch,and there is no speedup. Then I test craft_mlt_25k.trt(torch-onnx-trt),there is no speedup too for FP32. I test only the model inference time.

With tensorrt, I test that FP32 no speedup, FP16 is faster, and FP16 is the same speed with INT8.

ltm920716 avatar Aug 23 '21 02:08 ltm920716

here is the results from the original test.py in craft git:

image torch.Size([1, 3, 1280, 736]) time up?: 0.05096149444580078 time up?: 0.04998612403869629 time up?: 0.05093955993652344 time up?: 0.05080008506774902 time up?: 0.05109596252441406

image torch.Size([8, 3, 1280, 736]) time up?: 0.39319276809692383 time up?: 0.39789867401123047 time up?: 0.39710474014282227 time up?: 0.39400172233581543 time up?: 0.39536428451538086

so I am so confused.

ltm920716 avatar Aug 23 '21 03:08 ltm920716

@ltm920716 , you test .pth on local or .pt (Torchscript) on Triton server (put in Model Repository)?

@k9ele7en I test original torch model craft_mlt_25k.pth on torch==1.7.0 with batch,and there is no speedup. Then I test craft_mlt_25k.trt(torch-onnx-trt),there is no speedup too for FP32. I test only the model inference time.

With tensorrt, I test that FP32 no speedup, FP16 is faster, and FP16 is the same speed with INT8.

@ltm920716 , yes, model need to be large enough so that RT engine make difference in performance (time). I not sure CRAFT is big enough, just an example for large scale solution. TensorRT+Triton often combine together in big deploy inference solution such as in medical, manufacturing business...

k9ele7en avatar Aug 23 '21 03:08 k9ele7en

here is the results from the original test.py in craft git:

image torch.Size([1, 3, 1280, 736]) time up?: 0.05096149444580078 time up?: 0.04998612403869629 time up?: 0.05093955993652344 time up?: 0.05080008506774902 time up?: 0.05109596252441406

image torch.Size([8, 3, 1280, 736]) time up?: 0.39319276809692383 time up?: 0.39789867401123047 time up?: 0.39710474014282227 time up?: 0.39400172233581543 time up?: 0.39536428451538086

so I am so confused.

@ltm920716 , I have not benchmark or done experiment batching with RT yet. But currently in config, I set batch (dynamic input) fixed=1 for all three values (min, max, opt), you can try again by set max batch size before export ONNX, then RT... (https://github.com/k9ele7en/Triton-TensorRT-Inference-CRAFT-pytorch/blob/c6824359593500c61c51b84de54935468da595a0/converters/config.py#L42)

Let refer to this: https://docs.nvidia.com/deeplearning/tensorrt/best-practices/index.html#batching Btw, Why don't you pass x with batch size 3 to .pth model, then compare to RT engine...

k9ele7en avatar Aug 23 '21 03:08 k9ele7en

https://github.com/k9ele7en/Triton-TensorRT-Inference-CRAFT-pytorch/issues/1#issuecomment-903414622 @k9ele7en thanks! the test above is only in original torch model, and I found that batch inference has no effect improve. I will compare with RT next. Thanks again

ltm920716 avatar Aug 23 '21 03:08 ltm920716

@ltm920716 no problems, it would be great when you share the results (both bad/good) of your experiment so that we can discuss and people can find useful informations and avoid mistakes in future...

k9ele7en avatar Aug 23 '21 03:08 k9ele7en

@ltm920716 no problems, it would be great when you share the results (both bad/good) of your experiment so that we can discuss and people can find useful informations and avoid mistakes in future...

ok,I will try

ltm920716 avatar Aug 23 '21 04:08 ltm920716