generative_inpainting trying to explain deconvlution kernel size in contextual attention

I have found some issues about this in #73 and #22 and according to the link provided by @JiahuiYu , I did some experiments and tried to explain it here. This also confuses me for several days and I hope this helps.

At first, I generated a specified image to do the experiment. The first channel of this image is 4*6, and the pixel value is like below, which helps us for confirmation.

 0  1  2  3  4  5
 6  7  8  9 10 11
...

def gen_simple_image():
    height,width,channel = 4,6,3
    img = np.reshape(np.arange(height*width*channel),(channel,height,width))
    img = np.transpose(img,[1,2,0])
    cv2.imwrite('simple_image.bmp',img)

The two images below correspond to the situation in one dimension. We can see in this situation, kernel size is always two times of stride. This explains why the code below has number 2. https://github.com/JiahuiYu/generative_inpainting/blob/06cd62cfca8c10c349b451fa33d9cbb786bfaa20/inpaint_ops.py#L242 2019-07-16_224336 2019-07-16_225146 Moreover, we can see that two values cover the same pixel so we want to average them by 2. Our situation is two dimension and we can image we should average them by 4 in the code below. https://github.com/JiahuiYu/generative_inpainting/blob/06cd62cfca8c10c349b451fa33d9cbb786bfaa20/inpaint_ops.py#L307 We can do the experiment and check whether the output y in function contextual_attention is the same as its input f or b(f and b are the same). It should be very closed from the intention of contextual_attention——We can borrow any patches from b and paste to y according to the similarity between b and f(Some may wonder that, in proposed model, the output y will always the same b. Actually, it won't if the mask is not all zero. In our experiment, mask is all zero, so y should be very closed to b. The mask stops y from borrowing patches in b in masked region. y will borrow patches outside masked region). Result is shown below, we can see that except some pixels in the boundary, other pixels are exactly the same. 2019-07-16_230429

Should convolution kernel size(w) the same as deconvolution kernel size(raw_w) as stated in #22? In my opinion, convolution and its subsequent softmax decide which the deconvolution kernel should be put where. Thus the kernel size of deconvolution does not matter much. We can try to change parameter ksize from 3 to 5. y should be the same as above. Actually it is as I did the experiment.

I wondered if we can change 2 or/and 4 in the code above to get the same effect. I found another pattern shown below. 2019-07-16_234235 2019-07-16_234329 According to the pattern shown above, I change 2 to 1(or 3) and change 4 to 1 (or 9). Finally I got y very closed to input b except some boundary pixel. 2019-07-16_234754 2019-07-17_000538 However, there is one more detail I did not mention above. While we were changing 2 and 4 above, we should keep kernel size is divisible by stride. Namely, we should keep kernel is divisible by rate * stride in the code below. As stated in [2], we do this to have even overlap. https://github.com/JiahuiYu/generative_inpainting/blob/06cd62cfca8c10c349b451fa33d9cbb786bfaa20/inpaint_ops.py#L243-L244 However, we could have uneven overlap if we can average deconvolution result using corresponding uneven mask. The picture shown below has uneven mask and we simulate this situation. We can change kernel to 4 and rate * stride to 3 and set a mask as shown below. After deconvolution, yi should divide this mask. Again, we got output y very closed to b. Notice that the height and width of input f,b should be multiple times of rate * stride, which is 3 in this case. So the mask below is 3 * 6 instead of 4 * 6 like above.

1 1 1 2 1 1
1 1 1 2 1 1
1 1 1 2 1 1

Besides, in #22, @JiahuiYu said 'Basically larger patch mean more accurate of original resolution.'. In my opinion, convolution result gives rough location in low resolution where filters(raw_w) should be put in later deconvolution. Large patch covers more pixels and every pixel in y gets more to average in deconvolution, which may gives more accurate result. Below image shows larger patches. 2019-07-17_000538

I want to sum up the effect of variables rate and stride in function contextual_attention: rate: downsample f,b,mask stride: decrease number of convolution/deconvolution filters

Discussion above is based on the effect of deconvolution is to paste its kernels all around, which is decided by convolution and softmax output yi. I comprehend this from the perspective that deconvolution is another kind of convolution as stated in [1]. However I can't explain it in more details. May someone can do that for us.

I hope you guys understand what I am talking about ^_^.

Reference: [1] A guide to convolution arithmetic for deep learning [2] Deconvolution and Checkerboard Artifacts

Jul 16 '19 16:07 12ycli

@JiahuiYu I wonder if we really have to use deconvolution to copy patches of b to y. I wrote a toy function to do this.

def deconv():
    img = np.zeros(shape=(1,5,5,1))
    img[0,3,3,0] = 1
    print('sigmoid output:\n', img[0,:,:,0])
    filter = np.reshape(np.arange(9),(3,3))
    print('filter:\n', filter)
    filter = np.fliplr(np.flipud(filter))
    # print('filter:\n',filter)
    filter = np.expand_dims(filter,2)
    filter = np.expand_dims(filter,3)

    output = tf.nn.conv2d(img,filter,strides=[1,1,1,1],padding='SAME')

    tf.InteractiveSession()
    print('output:\n', output.eval()[0,:,:,0])

sigmoid output: [[0. 0. 0. 0. 0.] [0. 0. 0. 0. 0.] [0. 0. 0. 1. 0.] [0. 0. 0. 0. 0.]] filter: [[0 1 2] [3 4 5] [6 7 8]] output: [[0. 0. 0. 0. 0.] [0. 0. 0. 0. 0.] [0. 0. 0. 1. 2.] [0. 0. 3. 4. 5.] [0. 0. 6. 7. 8.]]

Jul 17 '19 14:07 12ycli

However, toy function above have some transpose and convolution operation. Maybe this is deconvolution? Deconvolution have uneven overlaps because it have to keep the resolution right? I will hold this in my head and figure it out someday.

Jul 18 '19 01:07 12ycli

Yes, toy function above is deconvolution.

def cheak_deconv():
    image = np.reshape(np.arange(25.0),(5,5))

    image = np.expand_dims(image,0)
    image = np.expand_dims(image,3)

    filter = np.reshape(np.arange(9.0),(3,3))
    transposed_filter = np.fliplr(np.flipud(filter))
    filter = np.expand_dims(filter,2)
    filter = np.expand_dims(filter,3)
    transposed_filter = np.expand_dims(transposed_filter,2)
    transposed_filter = np.expand_dims(transposed_filter,3)

    conv = tf.nn.conv2d(image,filter,strides=[1,1,1,1],padding='VALID')

    tf.InteractiveSession()

    transposed_conv = tf.nn.conv2d_transpose(conv,filter,[1,5,5,1],padding='VALID')
    print('tensorflow conv2d_transpose:\n', transposed_conv.eval()[0,:,:,0])

    padded_conv_output = tf.pad(conv,[[0,0],[2,2],[2,2],[0,0]])
    output3 = tf.nn.conv2d(padded_conv_output, transposed_filter, strides=[1,1,1,1], padding='VALID')
    print('simulate deconv using transposed filter and conv:\n', output3.eval()[0,:,:,0])

tensorflow conv2d_transpose: [[ 0. 312. 972. 1080. 768.] [ 936. 2784. 5616. 4896. 3048.] [ 3348. 8496. 15552. 12528. 7380.] [ 4968. 11424. 19440. 14688. 8232.] [ 4032. 8952. 14796. 10872. 5952.]] simulate deconv using transposed filter and conv: [[ 0. 312. 972. 1080. 768.] [ 936. 2784. 5616. 4896. 3048.] [ 3348. 8496. 15552. 12528. 7380.] [ 4968. 11424. 19440. 14688. 8232.] [ 4032. 8952. 14796. 10872. 5952.]]

Jul 18 '19 07:07 12ycli

@12ycli Hi, thanks for your interest in our work first.

And thanks for your continuous efforts on explaining these stuff to others. I will keep this thread open so others can also have a references. If you can slightly reorganize of your contents in this thread to be even more clear, I would be super appreciating. Thanks and good job!

Jul 25 '19 05:07 JiahuiYu

generative_inpainting generative_inpainting copied to clipboard

trying to explain deconvlution kernel size in contextual attention

generative_inpainting
generative_inpainting copied to clipboard