spconv
spconv copied to clipboard
the first spconv node gives an abnormal latency when I try to deploy on tensor-rt
on the screenshot, the latency of the first spconv node is almost three times as the other spconv nodes.
and the avg and median latency of spconv have a significant gap
anyone met this issue and solve it?
Maybe it's the size of input? What is the size of your first conv's input?
@CaptainRui1000 how did you get TensorRT plugin work? I wrote symbolic op and C++ plugin for SparseImplicitGemmFunction, then exported my NN with spconv.SubMConv2d in ONNX. Did you do the same? I tested TensorRT engine with the same input data (input to SubMConv2d) I used to export, but it seems that input to SparseImplicitGemmFunction in Python and SparseImplicitGemmFunction plugin in C++ differs significantly...
Maybe it's the size of input? What is the size of your first conv's input?
the valid size of testing input data I used is about 42k. after further developing and more trials, I think this issue maybe caused by the statistical strategy of trtexec
@CaptainRui1000 how did you get TensorRT plugin work? I wrote symbolic op and C++ plugin for SparseImplicitGemmFunction, then exported my NN with spconv.SubMConv2d in ONNX. Did you do the same? I tested TensorRT engine with the same input data (input to SubMConv2d) I used to export, but it seems that input to SparseImplicitGemmFunction in Python and SparseImplicitGemmFunction plugin in C++ differs significantly...
your method sounds similar to my way. if the difficulty you can't handle is about the input, maybe you can try to use tv::from_blob to make a pointer a tensor.
after setting --dumpProfile & --separateProfileRun at the same time, or just don't use the --dumpProfile flag to get an e2e latency, the avg and median latency of spconv are almost consistent. after setting --useSpinWait, the avg and median latency of first node become different, but still closed in total. so I think this issue maybe caused by the statistical strategy of trtexec.
now, i can infer an entire voxelnext using tensor-rt, and the latency of each layer measured by trtexec seems normal. total latency of this network on x86 is about 10ms, trtexec and actual program give similar latency. but trtexec gives 18ms and my actual deployment project gives 32ms, when i attempt to deploy on orin. i have set the nvpmodel of orin to maxn. and i used jtop to view the GPU load and found that during trtexec inference, the GPU load can stabilize at 80%, while in my project the GPU load is fluctuating. this is my current problem need to solve
after setting --dumpProfile & --separateProfileRun at the same time, or just don't use the --dumpProfile flag to get an e2e latency, the avg and median latency of spconv are almost consistent. after setting --useSpinWait, the avg and median latency of first node become different, but still closed in total. so I think this issue maybe caused by the statistical strategy of trtexec.
now, i can infer an entire voxelnext using tensor-rt, and the latency of each layer measured by trtexec seems normal. total latency of this network on x86 is about 10ms, trtexec and actual program give similar latency. but trtexec gives 18ms and my actual deployment project gives 32ms, when i attempt to deploy on orin. i have set the nvpmodel of orin to maxn. and i used jtop to view the GPU load and found that during trtexec inference, the GPU load can stabilize at 80%, while in my project the GPU load is fluctuating. this is my current problem need to solve
What's your gpu with x86 arch? 3090? The latency of my int8 engine on orin is about 4.5 times of the latency on 3090, not sure if this is normal
after setting --dumpProfile & --separateProfileRun at the same time, or just don't use the --dumpProfile flag to get an e2e latency, the avg and median latency of spconv are almost consistent. after setting --useSpinWait, the avg and median latency of first node become different, but still closed in total. so I think this issue maybe caused by the statistical strategy of trtexec. now, i can infer an entire voxelnext using tensor-rt, and the latency of each layer measured by trtexec seems normal. total latency of this network on x86 is about 10ms, trtexec and actual program give similar latency. but trtexec gives 18ms and my actual deployment project gives 32ms, when i attempt to deploy on orin. i have set the nvpmodel of orin to maxn. and i used jtop to view the GPU load and found that during trtexec inference, the GPU load can stabilize at 80%, while in my project the GPU load is fluctuating. this is my current problem need to solve
What's your gpu with x86 arch? 3090? The latency of my int8 engine on orin is about 4.5 times of the latency on 3090, not sure if this is normal
the gpu I used is a 3060 laptop, latency of my fp16 engine on this gpu is about 10ms measured by trtexec and time log. when I used the same fp16 onnx on orin, trtexec showed 18ms and time log showed 32ms, which I think is abnormal, right? the latencys on orin are not consistent.
after setting --dumpProfile & --separateProfileRun at the same time, or just don't use the --dumpProfile flag to get an e2e latency, the avg and median latency of spconv are almost consistent. after setting --useSpinWait, the avg and median latency of first node become different, but still closed in total. so I think this issue maybe caused by the statistical strategy of trtexec. now, i can infer an entire voxelnext using tensor-rt, and the latency of each layer measured by trtexec seems normal. total latency of this network on x86 is about 10ms, trtexec and actual program give similar latency. but trtexec gives 18ms and my actual deployment project gives 32ms, when i attempt to deploy on orin. i have set the nvpmodel of orin to maxn. and i used jtop to view the GPU load and found that during trtexec inference, the GPU load can stabilize at 80%, while in my project the GPU load is fluctuating. this is my current problem need to solve
What's your gpu with x86 arch? 3090? The latency of my int8 engine on orin is about 4.5 times of the latency on 3090, not sure if this is normal
the gpu I used is a 3060 laptop, latency of my fp16 engine on this gpu is about 10ms measured by trtexec and time log. when I used the same fp16 onnx on orin, trtexec showed 18ms and time log showed 32ms, which I think is abnormal, right? the latencys on orin are not consistent.
I'm not sure, I tried on drive orin, but not jetson orin. By time log, do you mean dumpProfile? I tried to infer using c++ api and use timer on stream, the infer time is consistent with what I get with trtexec
after setting --dumpProfile & --separateProfileRun at the same time, or just don't use the --dumpProfile flag to get an e2e latency, the avg and median latency of spconv are almost consistent. after setting --useSpinWait, the avg and median latency of first node become different, but still closed in total. so I think this issue maybe caused by the statistical strategy of trtexec. now, i can infer an entire voxelnext using tensor-rt, and the latency of each layer measured by trtexec seems normal. total latency of this network on x86 is about 10ms, trtexec and actual program give similar latency. but trtexec gives 18ms and my actual deployment project gives 32ms, when i attempt to deploy on orin. i have set the nvpmodel of orin to maxn. and i used jtop to view the GPU load and found that during trtexec inference, the GPU load can stabilize at 80%, while in my project the GPU load is fluctuating. this is my current problem need to solve
What's your gpu with x86 arch? 3090? The latency of my int8 engine on orin is about 4.5 times of the latency on 3090, not sure if this is normal
the gpu I used is a 3060 laptop, latency of my fp16 engine on this gpu is about 10ms measured by trtexec and time log. when I used the same fp16 onnx on orin, trtexec showed 18ms and time log showed 32ms, which I think is abnormal, right? the latencys on orin are not consistent.
I'm not sure, I tried on drive orin, but not jetson orin. By time log, do you mean dumpProfile? I tried to infer using c++ api and use timer on stream, the infer time is consistent with what I get with trtexec
I use chrono to make time log. so I should use cudaEvent ?
after setting --dumpProfile & --separateProfileRun at the same time, or just don't use the --dumpProfile flag to get an e2e latency, the avg and median latency of spconv are almost consistent. after setting --useSpinWait, the avg and median latency of first node become different, but still closed in total. so I think this issue maybe caused by the statistical strategy of trtexec. now, i can infer an entire voxelnext using tensor-rt, and the latency of each layer measured by trtexec seems normal. total latency of this network on x86 is about 10ms, trtexec and actual program give similar latency. but trtexec gives 18ms and my actual deployment project gives 32ms, when i attempt to deploy on orin. i have set the nvpmodel of orin to maxn. and i used jtop to view the GPU load and found that during trtexec inference, the GPU load can stabilize at 80%, while in my project the GPU load is fluctuating. this is my current problem need to solve
What's your gpu with x86 arch? 3090? The latency of my int8 engine on orin is about 4.5 times of the latency on 3090, not sure if this is normal
the gpu I used is a 3060 laptop, latency of my fp16 engine on this gpu is about 10ms measured by trtexec and time log. when I used the same fp16 onnx on orin, trtexec showed 18ms and time log showed 32ms, which I think is abnormal, right? the latencys on orin are not consistent.
I'm not sure, I tried on drive orin, but not jetson orin. By time log, do you mean dumpProfile? I tried to infer using c++ api and use timer on stream, the infer time is consistent with what I get with trtexec
I use chrono to make time log. so I should use cudaEvent ?
From what I experienced, chrono should be accurate enough, maybe I'm wrong. I think if you time the gemm and indice pair kernel and do stats on it should make more sense.
after setting --dumpProfile & --separateProfileRun at the same time, or just don't use the --dumpProfile flag to get an e2e latency, the avg and median latency of spconv are almost consistent. after setting --useSpinWait, the avg and median latency of first node become different, but still closed in total. so I think this issue maybe caused by the statistical strategy of trtexec.
now, i can infer an entire voxelnext using tensor-rt, and the latency of each layer measured by trtexec seems normal. total latency of this network on x86 is about 10ms, trtexec and actual program give similar latency. but trtexec gives 18ms and my actual deployment project gives 32ms, when i attempt to deploy on orin. i have set the nvpmodel of orin to maxn. and i used jtop to view the GPU load and found that during trtexec inference, the GPU load can stabilize at 80%, while in my project the GPU load is fluctuating. this is my current problem need to solve
hi, would you share how to convert the backbone of voxelnext? Thanks