spconv icon indicating copy to clipboard operation
spconv copied to clipboard

the first spconv node gives an abnormal latency when I try to deploy on tensor-rt

Open CaptainRui1000 opened this issue 1 year ago • 11 comments

8ddb8235-1d5b-42d6-8cf6-acb7649d3914

on the screenshot, the latency of the first spconv node is almost three times as the other spconv nodes.

and the avg and median latency of spconv have a significant gap

anyone met this issue and solve it?

CaptainRui1000 avatar Oct 30 '23 03:10 CaptainRui1000

Maybe it's the size of input? What is the size of your first conv's input?

superpigforever avatar Nov 02 '23 06:11 superpigforever

@CaptainRui1000 how did you get TensorRT plugin work? I wrote symbolic op and C++ plugin for SparseImplicitGemmFunction, then exported my NN with spconv.SubMConv2d in ONNX. Did you do the same? I tested TensorRT engine with the same input data (input to SubMConv2d) I used to export, but it seems that input to SparseImplicitGemmFunction in Python and SparseImplicitGemmFunction plugin in C++ differs significantly...

ArseniuML avatar Nov 15 '23 08:11 ArseniuML

Maybe it's the size of input? What is the size of your first conv's input?

the valid size of testing input data I used is about 42k. after further developing and more trials, I think this issue maybe caused by the statistical strategy of trtexec

CaptainRui1000 avatar Nov 28 '23 06:11 CaptainRui1000

@CaptainRui1000 how did you get TensorRT plugin work? I wrote symbolic op and C++ plugin for SparseImplicitGemmFunction, then exported my NN with spconv.SubMConv2d in ONNX. Did you do the same? I tested TensorRT engine with the same input data (input to SubMConv2d) I used to export, but it seems that input to SparseImplicitGemmFunction in Python and SparseImplicitGemmFunction plugin in C++ differs significantly...

your method sounds similar to my way. if the difficulty you can't handle is about the input, maybe you can try to use tv::from_blob to make a pointer a tensor.

CaptainRui1000 avatar Nov 28 '23 06:11 CaptainRui1000

after setting --dumpProfile & --separateProfileRun at the same time, or just don't use the --dumpProfile flag to get an e2e latency, the avg and median latency of spconv are almost consistent. after setting --useSpinWait, the avg and median latency of first node become different, but still closed in total. so I think this issue maybe caused by the statistical strategy of trtexec.

now, i can infer an entire voxelnext using tensor-rt, and the latency of each layer measured by trtexec seems normal. total latency of this network on x86 is about 10ms, trtexec and actual program give similar latency. but trtexec gives 18ms and my actual deployment project gives 32ms, when i attempt to deploy on orin. i have set the nvpmodel of orin to maxn. and i used jtop to view the GPU load and found that during trtexec inference, the GPU load can stabilize at 80%, while in my project the GPU load is fluctuating. this is my current problem need to solve

CaptainRui1000 avatar Nov 28 '23 07:11 CaptainRui1000

after setting --dumpProfile & --separateProfileRun at the same time, or just don't use the --dumpProfile flag to get an e2e latency, the avg and median latency of spconv are almost consistent. after setting --useSpinWait, the avg and median latency of first node become different, but still closed in total. so I think this issue maybe caused by the statistical strategy of trtexec.

now, i can infer an entire voxelnext using tensor-rt, and the latency of each layer measured by trtexec seems normal. total latency of this network on x86 is about 10ms, trtexec and actual program give similar latency. but trtexec gives 18ms and my actual deployment project gives 32ms, when i attempt to deploy on orin. i have set the nvpmodel of orin to maxn. and i used jtop to view the GPU load and found that during trtexec inference, the GPU load can stabilize at 80%, while in my project the GPU load is fluctuating. this is my current problem need to solve

What's your gpu with x86 arch? 3090? The latency of my int8 engine on orin is about 4.5 times of the latency on 3090, not sure if this is normal

superpigforever avatar Nov 28 '23 09:11 superpigforever

after setting --dumpProfile & --separateProfileRun at the same time, or just don't use the --dumpProfile flag to get an e2e latency, the avg and median latency of spconv are almost consistent. after setting --useSpinWait, the avg and median latency of first node become different, but still closed in total. so I think this issue maybe caused by the statistical strategy of trtexec. now, i can infer an entire voxelnext using tensor-rt, and the latency of each layer measured by trtexec seems normal. total latency of this network on x86 is about 10ms, trtexec and actual program give similar latency. but trtexec gives 18ms and my actual deployment project gives 32ms, when i attempt to deploy on orin. i have set the nvpmodel of orin to maxn. and i used jtop to view the GPU load and found that during trtexec inference, the GPU load can stabilize at 80%, while in my project the GPU load is fluctuating. this is my current problem need to solve

What's your gpu with x86 arch? 3090? The latency of my int8 engine on orin is about 4.5 times of the latency on 3090, not sure if this is normal

the gpu I used is a 3060 laptop, latency of my fp16 engine on this gpu is about 10ms measured by trtexec and time log. when I used the same fp16 onnx on orin, trtexec showed 18ms and time log showed 32ms, which I think is abnormal, right? the latencys on orin are not consistent.

CaptainRui1000 avatar Nov 29 '23 02:11 CaptainRui1000

after setting --dumpProfile & --separateProfileRun at the same time, or just don't use the --dumpProfile flag to get an e2e latency, the avg and median latency of spconv are almost consistent. after setting --useSpinWait, the avg and median latency of first node become different, but still closed in total. so I think this issue maybe caused by the statistical strategy of trtexec. now, i can infer an entire voxelnext using tensor-rt, and the latency of each layer measured by trtexec seems normal. total latency of this network on x86 is about 10ms, trtexec and actual program give similar latency. but trtexec gives 18ms and my actual deployment project gives 32ms, when i attempt to deploy on orin. i have set the nvpmodel of orin to maxn. and i used jtop to view the GPU load and found that during trtexec inference, the GPU load can stabilize at 80%, while in my project the GPU load is fluctuating. this is my current problem need to solve

What's your gpu with x86 arch? 3090? The latency of my int8 engine on orin is about 4.5 times of the latency on 3090, not sure if this is normal

the gpu I used is a 3060 laptop, latency of my fp16 engine on this gpu is about 10ms measured by trtexec and time log. when I used the same fp16 onnx on orin, trtexec showed 18ms and time log showed 32ms, which I think is abnormal, right? the latencys on orin are not consistent.

I'm not sure, I tried on drive orin, but not jetson orin. By time log, do you mean dumpProfile? I tried to infer using c++ api and use timer on stream, the infer time is consistent with what I get with trtexec

superpigforever avatar Nov 29 '23 09:11 superpigforever

after setting --dumpProfile & --separateProfileRun at the same time, or just don't use the --dumpProfile flag to get an e2e latency, the avg and median latency of spconv are almost consistent. after setting --useSpinWait, the avg and median latency of first node become different, but still closed in total. so I think this issue maybe caused by the statistical strategy of trtexec. now, i can infer an entire voxelnext using tensor-rt, and the latency of each layer measured by trtexec seems normal. total latency of this network on x86 is about 10ms, trtexec and actual program give similar latency. but trtexec gives 18ms and my actual deployment project gives 32ms, when i attempt to deploy on orin. i have set the nvpmodel of orin to maxn. and i used jtop to view the GPU load and found that during trtexec inference, the GPU load can stabilize at 80%, while in my project the GPU load is fluctuating. this is my current problem need to solve

What's your gpu with x86 arch? 3090? The latency of my int8 engine on orin is about 4.5 times of the latency on 3090, not sure if this is normal

the gpu I used is a 3060 laptop, latency of my fp16 engine on this gpu is about 10ms measured by trtexec and time log. when I used the same fp16 onnx on orin, trtexec showed 18ms and time log showed 32ms, which I think is abnormal, right? the latencys on orin are not consistent.

I'm not sure, I tried on drive orin, but not jetson orin. By time log, do you mean dumpProfile? I tried to infer using c++ api and use timer on stream, the infer time is consistent with what I get with trtexec

I use chrono to make time log. so I should use cudaEvent ?

CaptainRui1000 avatar Dec 01 '23 06:12 CaptainRui1000

after setting --dumpProfile & --separateProfileRun at the same time, or just don't use the --dumpProfile flag to get an e2e latency, the avg and median latency of spconv are almost consistent. after setting --useSpinWait, the avg and median latency of first node become different, but still closed in total. so I think this issue maybe caused by the statistical strategy of trtexec. now, i can infer an entire voxelnext using tensor-rt, and the latency of each layer measured by trtexec seems normal. total latency of this network on x86 is about 10ms, trtexec and actual program give similar latency. but trtexec gives 18ms and my actual deployment project gives 32ms, when i attempt to deploy on orin. i have set the nvpmodel of orin to maxn. and i used jtop to view the GPU load and found that during trtexec inference, the GPU load can stabilize at 80%, while in my project the GPU load is fluctuating. this is my current problem need to solve

What's your gpu with x86 arch? 3090? The latency of my int8 engine on orin is about 4.5 times of the latency on 3090, not sure if this is normal

the gpu I used is a 3060 laptop, latency of my fp16 engine on this gpu is about 10ms measured by trtexec and time log. when I used the same fp16 onnx on orin, trtexec showed 18ms and time log showed 32ms, which I think is abnormal, right? the latencys on orin are not consistent.

I'm not sure, I tried on drive orin, but not jetson orin. By time log, do you mean dumpProfile? I tried to infer using c++ api and use timer on stream, the infer time is consistent with what I get with trtexec

I use chrono to make time log. so I should use cudaEvent ?

From what I experienced, chrono should be accurate enough, maybe I'm wrong. I think if you time the gemm and indice pair kernel and do stats on it should make more sense.

superpigforever avatar Dec 01 '23 07:12 superpigforever

after setting --dumpProfile & --separateProfileRun at the same time, or just don't use the --dumpProfile flag to get an e2e latency, the avg and median latency of spconv are almost consistent. after setting --useSpinWait, the avg and median latency of first node become different, but still closed in total. so I think this issue maybe caused by the statistical strategy of trtexec.

now, i can infer an entire voxelnext using tensor-rt, and the latency of each layer measured by trtexec seems normal. total latency of this network on x86 is about 10ms, trtexec and actual program give similar latency. but trtexec gives 18ms and my actual deployment project gives 32ms, when i attempt to deploy on orin. i have set the nvpmodel of orin to maxn. and i used jtop to view the GPU load and found that during trtexec inference, the GPU load can stabilize at 80%, while in my project the GPU load is fluctuating. this is my current problem need to solve

hi, would you share how to convert the backbone of voxelnext? Thanks

cugsgl avatar Dec 28 '23 07:12 cugsgl