generative_inpainting
generative_inpainting copied to clipboard
trying to explain deconvlution kernel size in contextual attention
I have found some issues about this in #73 and #22 and according to the link provided by @JiahuiYu , I did some experiments and tried to explain it here. This also confuses me for several days and I hope this helps.
At first, I generated a specified image to do the experiment. The first channel of this image is 4*6, and the pixel value is like below, which helps us for confirmation.
0 1 2 3 4 5
6 7 8 9 10 11
...
def gen_simple_image():
height,width,channel = 4,6,3
img = np.reshape(np.arange(height*width*channel),(channel,height,width))
img = np.transpose(img,[1,2,0])
cv2.imwrite('simple_image.bmp',img)
The two images below correspond to the situation in one dimension. We can see in this situation, kernel size is always two times of stride. This explains why the code below has number 2.
https://github.com/JiahuiYu/generative_inpainting/blob/06cd62cfca8c10c349b451fa33d9cbb786bfaa20/inpaint_ops.py#L242
Moreover, we can see that two values cover the same pixel so we want to average them by 2. Our situation is two dimension and we can image we should average them by 4 in the code below.
https://github.com/JiahuiYu/generative_inpainting/blob/06cd62cfca8c10c349b451fa33d9cbb786bfaa20/inpaint_ops.py#L307
We can do the experiment and check whether the output y in function
contextual_attention
is the same as its input f or b(f and b are the same). It should be very closed from the intention of contextual_attention
——We can borrow any patches from b and paste to y according to the similarity between b and f(Some may wonder that, in proposed model, the output y will always the same b. Actually, it won't if the mask is not all zero. In our experiment, mask is all zero, so y should be very closed to b. The mask stops y from borrowing patches in b in masked region. y will borrow patches outside masked region). Result is shown below, we can see that except some pixels in the boundary, other pixels are exactly the same.
Should convolution kernel size(w) the same as deconvolution kernel size(raw_w) as stated in #22? In my opinion, convolution and its subsequent softmax decide which the deconvolution kernel should be put where. Thus the kernel size of deconvolution does not matter much. We can try to change parameter ksize from 3 to 5. y should be the same as above. Actually it is as I did the experiment.
I wondered if we can change 2 or/and 4 in the code above to get the same effect. I found another pattern shown below.
According to the pattern shown above, I change 2 to 1(or 3) and change 4 to 1 (or 9). Finally I got y very closed to input b except some boundary pixel.
However, there is one more detail I did not mention above. While we were changing 2 and 4 above, we should keep kernel size is divisible by stride. Namely, we should keep
kernel
is divisible by rate * stride
in the code below. As stated in [2], we do this to have even overlap.
https://github.com/JiahuiYu/generative_inpainting/blob/06cd62cfca8c10c349b451fa33d9cbb786bfaa20/inpaint_ops.py#L243-L244
However, we could have uneven overlap if we can average deconvolution result using corresponding uneven mask. The picture shown below has uneven mask and we simulate this situation. We can change kernel
to 4 and rate * stride
to 3 and set a mask as shown below. After deconvolution, yi should divide this mask. Again, we got output y very closed to b. Notice that the height and width of input f,b should be multiple times of rate * stride
, which is 3 in this case. So the mask below is 3 * 6 instead of 4 * 6 like above.
1 1 1 2 1 1
1 1 1 2 1 1
1 1 1 2 1 1
Besides, in #22, @JiahuiYu said 'Basically larger patch mean more accurate of original resolution.'. In my opinion, convolution result gives rough location in low resolution where filters(raw_w) should be put in later deconvolution. Large patch covers more pixels and every pixel in y gets more to average in deconvolution, which may gives more accurate result. Below image shows larger patches.
I want to sum up the effect of variables rate
and stride
in function contextual_attention
:
rate: downsample f,b,mask
stride: decrease number of convolution/deconvolution filters
Discussion above is based on the effect of deconvolution is to paste its kernels all around, which is decided by convolution and softmax output yi. I comprehend this from the perspective that deconvolution is another kind of convolution as stated in [1]. However I can't explain it in more details. May someone can do that for us.
I hope you guys understand what I am talking about ^_^.
Reference: [1] A guide to convolution arithmetic for deep learning [2] Deconvolution and Checkerboard Artifacts
@JiahuiYu I wonder if we really have to use deconvolution to copy patches of b to y. I wrote a toy function to do this.
def deconv():
img = np.zeros(shape=(1,5,5,1))
img[0,3,3,0] = 1
print('sigmoid output:\n', img[0,:,:,0])
filter = np.reshape(np.arange(9),(3,3))
print('filter:\n', filter)
filter = np.fliplr(np.flipud(filter))
# print('filter:\n',filter)
filter = np.expand_dims(filter,2)
filter = np.expand_dims(filter,3)
output = tf.nn.conv2d(img,filter,strides=[1,1,1,1],padding='SAME')
tf.InteractiveSession()
print('output:\n', output.eval()[0,:,:,0])
sigmoid output: [[0. 0. 0. 0. 0.] [0. 0. 0. 0. 0.] [0. 0. 0. 1. 0.] [0. 0. 0. 0. 0.]] filter: [[0 1 2] [3 4 5] [6 7 8]] output: [[0. 0. 0. 0. 0.] [0. 0. 0. 0. 0.] [0. 0. 0. 1. 2.] [0. 0. 3. 4. 5.] [0. 0. 6. 7. 8.]]
However, toy function above have some transpose and convolution operation. Maybe this is deconvolution? Deconvolution have uneven overlaps because it have to keep the resolution right? I will hold this in my head and figure it out someday.
Yes, toy function above is deconvolution.
def cheak_deconv():
image = np.reshape(np.arange(25.0),(5,5))
image = np.expand_dims(image,0)
image = np.expand_dims(image,3)
filter = np.reshape(np.arange(9.0),(3,3))
transposed_filter = np.fliplr(np.flipud(filter))
filter = np.expand_dims(filter,2)
filter = np.expand_dims(filter,3)
transposed_filter = np.expand_dims(transposed_filter,2)
transposed_filter = np.expand_dims(transposed_filter,3)
conv = tf.nn.conv2d(image,filter,strides=[1,1,1,1],padding='VALID')
tf.InteractiveSession()
transposed_conv = tf.nn.conv2d_transpose(conv,filter,[1,5,5,1],padding='VALID')
print('tensorflow conv2d_transpose:\n', transposed_conv.eval()[0,:,:,0])
padded_conv_output = tf.pad(conv,[[0,0],[2,2],[2,2],[0,0]])
output3 = tf.nn.conv2d(padded_conv_output, transposed_filter, strides=[1,1,1,1], padding='VALID')
print('simulate deconv using transposed filter and conv:\n', output3.eval()[0,:,:,0])
tensorflow conv2d_transpose: [[ 0. 312. 972. 1080. 768.] [ 936. 2784. 5616. 4896. 3048.] [ 3348. 8496. 15552. 12528. 7380.] [ 4968. 11424. 19440. 14688. 8232.] [ 4032. 8952. 14796. 10872. 5952.]] simulate deconv using transposed filter and conv: [[ 0. 312. 972. 1080. 768.] [ 936. 2784. 5616. 4896. 3048.] [ 3348. 8496. 15552. 12528. 7380.] [ 4968. 11424. 19440. 14688. 8232.] [ 4032. 8952. 14796. 10872. 5952.]]
@12ycli Hi, thanks for your interest in our work first.
And thanks for your continuous efforts on explaining these stuff to others. I will keep this thread open so others can also have a references. If you can slightly reorganize of your contents in this thread to be even more clear, I would be super appreciating. Thanks and good job!