Soft-NMS icon indicating copy to clipboard operation
Soft-NMS copied to clipboard

gpu softnms is slower because you use numpy in the middle of a torch ops

Open zylo117 opened this issue 5 years ago • 4 comments

torch execute gpu ops in async, so when you call numpy, torch will sync up all ops and transfer to cpu, then transfer back when you call torch ops again, which is extremely slow. https://github.com/DocF/Soft-NMS/blob/95dab79eac5c786f61fef2f6d5cd633eec7ecfd6/softnms_pytorch.py#L51

zylo117 avatar Jul 24 '20 01:07 zylo117

@zylo117 Hi, I implemented the pure pytorch version of the softnms function and I currently write it in the google colab file: https://colab.research.google.com/drive/1gzhXX-LyMdZ41qHv0rKzHpxKLYliEiPn?usp=sharing

What made me confused is that GPU processes the speed() function for even longer time, which is shown below or in the colab notebook. Do you know what is going on?

PyTorch 1.5.1+cu101 CPU
Pure PyTorch, average run time: 24.799434 ms
With NumPy, average run time: 24.725247 ms

PyTorch 1.5.1+cu101 _CudaDeviceProperties(name='Tesla P100-PCIE-16GB', major=6, minor=0, total_memory=16280MB, multi_processor_count=56)
Pure PyTorch, average run time: 67.407458 ms
With NumPy, average run time: 80.926901 ms

JinhangZhu avatar Jul 26 '20 18:07 JinhangZhu

In that case, I think it's unevitable that gpu version is slower than cpu's because the current algorithm runs in sequence, meaning the next loop requires results of the previous loop. But anyway, I find that softnms is not efficient at all, because it applies confidence thresholding after nms which increases nms processing time by a lot. It takes less than 1ms to perform confidence thresolding + vanilla nms but it now takes 24 ms for softnms. I'm afraid it might not be worthy.

zylo117 avatar Jul 27 '20 03:07 zylo117

@zylo117 Thank you for your explanation. I think I will try to improve the algorithm in the future.

JinhangZhu avatar Jul 27 '20 11:07 JinhangZhu

Thanks for the code!. Using pytorch ops may not be efficient for the GPU version for soft-nms. It would probably need a custom CUDA implementation where each loop can be partitioned into grids (for N classes) and threads (each loop) and then the loop would run for 50-100 iterations. With the right implementation, Soft-NMS should run within 1ms on a GPU. I'll try to push this version in a few weeks.

bharatsingh430 avatar Jun 12 '22 23:06 bharatsingh430