AdaptiveAveragePooling: Number of ways to split should evenly divide the split dimension
System information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Google Colab
- TensorFlow version and how it was installed (source or binary): Google Colab
- TensorFlow-Addons version and how it was installed (source or binary): 0.12.0
- Python version: 3
- Is GPU used? (yes/no): no
Describe the bug AdaptiveAveragePooling2D fails when input dimension not dividable by pool size In contrast, PyTorch version of AdaptivePooling2D works well in the same case.
Code to reproduce the issue
import tensorflow as tf
import tensorflow_addons as tfa
layer = tfa.layers.AdaptiveAveragePooling2D(3)
batch = tf.random.uniform((2, 16, 16, 5))
layer(batch)
Other info / logs
---------------------------------------------------------------------------
InvalidArgumentError Traceback (most recent call last)
<ipython-input-6-e9a53d37389e> in <module>()
1 layer = tfa.layers.AdaptiveAveragePooling2D(3)
2 batch = tf.random.uniform((2, 16, 16, 5))
----> 3 layer(batch)
6 frames
/usr/local/lib/python3.6/dist-packages/six.py in raise_from(value, from_value)
InvalidArgumentError: Number of ways to split should evenly divide the split dimension, but got split_dim 1 (size = 16) and num_split 3 [Op:Split] name: split
/cc @Susmit-A for visibility. Seems that pytorch will have overlapping kernel when the dimension is not divisible by output size. I wonder if this is achievable with python ops.
https://colab.research.google.com/drive/1KMSD4nMv_-P59TsbSxQksndBZ6oNyyri?usp=sharing
This was done on purpose, to prevent unexpected behaviour for different input sizes, as well as offer flexibility for shape.
The inputs can be clipped/padded as per requirement before passing to the layer.
Since different models perform differently based on whether data is clipped or padded, it's better to let the user decide the necessary action.
The PyTorch example uses overlapping windows (2 and 4 are accessed twice). I haven't yet found a way to do this in tensorflow without having an additional loop, which negatively impacts performance.
Hi @Susmit-A, I'd say we should try to support non-divisible case. If not, is there any different between adaptive and original pooling? We can always infer the strides and kernel sizes of tf.nn.{avg,max}_pool to output specific sizes. Maybe
stride = floor(input_size / output_size)
kernel_size = input_size − (output_size − 1) * stride
I guess.
https://stackoverflow.com/questions/53841509/how-does-adaptive-pooling-in-pytorch-work
The way PyTorch accomplishes this is well explained here:
https://discuss.pytorch.org/t/what-is-adaptiveavgpool2d/26897
The issue is the rounding, instead of floor/ceil. This gives unequal size pieces that can't be sliced and stacked for parallel pooling; each piece needs to be sliced and processed individually.
The method mentioned by @WindQAQ does work, but the output is not the same as that of PyTorch, since there is one less or one more value (in 1D arrays; in case of 2D images, its one pixel in each dimension) covered by the kernel compared to PyTorch. If that is not an issue, I can implement it in that manner. We can also also give an option to iteratively compute the pooled values to obtain the exact values that PyTorch provides, if the user requires so.
For divisible case, rounding up or down do not matter, which means current adaptive pooling is just a variant for tf.nn.{avg,max}_pool. We should do some benchmarks to see if that will be faster. We do not have to follow PyTorch implementation actually, but our implementation so far, divisible case, is the same as PyTorch I believe.
https://colab.research.google.com/drive/1GwoL4Sv2lRd_wp5eCtQogDnBQv7KounS?usp=sharing
For non-divisible case, I agree that it would be much slower if we try to implement with pure python ops. Let's see if the performance degradation is huge first, and improve it latter.
I have created a PR with the updated version of Adaptive Pooling. However, this is an order of magnitude slower than the PyTorch version. Methods to improve performance need to be explored. Specifically, the internal loop needs to be parallelized.