Faster implementation of `torchvision.ops.boxes::masks_to_boxes`
🚀 The feature
torchvision.ops.boxes::masks_to_boxes is used to convert a batch of binary 2D image masks to a set of bounding boxes. This proposal pertains to creating a faster and more general version of this function.
Motivation, pitch
The implementation of masks_to_boxes utilizes a for loop over the batch dimension to individually compute the bounding box for each mask, giving the function an $O(B)$ complexity and rendering it unsuitable for large batch sizes. Instead, we can recreate the function with intelligent usage of the in-build vectorized operations to achieve the same result at close to $O(1)$ complexity.
Some primitive performance benchmarking seems to validate this hypothesis.
The proposed version is also more general, as it can be extended to 3D masks easily.
Alternatives
N/A
Additional context
I've already implemented this feature and have a pull request ready. However, I wish to understand if the core contributors believe this is a worthwhile feature or not.
### Tasks
It is hard to asses without seeing the code, but unless you found a magic bullet, that should increase the memory requirement from O(1) to O(B), no? This becomes very relevant for larger batch sizes.
Feel free to send the PR especially when it is already done. But we need to benchmark it not only in terms of execution time to assess whether we want the change or not.
Thanks for opening the issue @atharvas . As @pmeier suggested, we'd be happy to look at a PR.
Did you find that masks_to_boxes was a bottleneck for you? Is it in the hot-path of operations?
Hello! Thanks for the input! I've submitted a PR here.
@NicolasHug: I'm using masks_to_boxes to crop an image based on the output of a mask prediction model as one step in my pipeline. All this runs on a GPU. My code was scaling poorly to large batch sizes, and the torch profiler attributed the bottleneck to torchvision.ops.boxes:masks_to_boxes. This motivated the pull request.
@pmeier: I stated incorrectly in the issue that the time complexity is $O(1)$. Apologies. The time and space complexity are still $O(B)$ but with a much smaller constant as the operations are vectorized. I've added the memory and the time graph to the PR here to show how both methods scale. Overall, the vectorized version seems to be scale better than the naive version in both speed and storage.
Thanks for the suggestion and happy to iterate on this.