Non-Local_Nets-Tensorflow I think the code is wrong

I compare the code between this version and Facebook's version. I find this version is not non-local net which published in the CVPR or arXiv. This is the source code of Facebook:

def spacetime_nonlocal(
        model, blob_in, dim_in, dim_out, batch_size, prefix, dim_inner,
        is_test, max_pool_stride=2):
    # ---------------------
    cur = blob_in
    # we do projection to convert each spacetime location to a feature
    # theta original size
    # e.g., (8, 1024, 4, 14, 14) => (8, 1024, 4, 14, 14)

    theta = model.ConvNd(
        cur, prefix + '_theta',
        dim_in,
        dim_inner,
        [1, 1, 1],
        strides=[1, 1, 1],
        pads=[0, 0, 0] * 2,
        weight_init=('GaussianFill', {'std': cfg.NONLOCAL.CONV_INIT_STD}),
        bias_init=('ConstantFill', {'value': 0.}), no_bias=cfg.NONLOCAL.NO_BIAS)

    # phi and g: half spatial size
    # e.g., (8, 1024, 4, 14, 14) => (8, 1024, 4, 7, 7)
    if cfg.NONLOCAL.USE_MAXPOOL is True:
        max_pool = model.MaxPool(
            cur, prefix + '_pool',
            kernels=[1, max_pool_stride, max_pool_stride],
            strides=[1, max_pool_stride, max_pool_stride],
            pads=[0, 0, 0] * 2,
        )
    else:
        max_pool = cur

    phi = model.ConvNd(
        max_pool, prefix + '_phi',
        dim_in,
        dim_inner,
        [1, 1, 1],
        strides=[1, 1, 1],
        pads=[0, 0, 0] * 2,
        weight_init=('GaussianFill', {'std': cfg.NONLOCAL.CONV_INIT_STD}),
        bias_init=('ConstantFill', {'value': 0.}), no_bias=cfg.NONLOCAL.NO_BIAS)

    g = model.ConvNd(
        max_pool, prefix + '_g',
        dim_in,
        dim_inner,
        [1, 1, 1],
        strides=[1, 1, 1],
        pads=[0, 0, 0] * 2,
        weight_init=('GaussianFill', {'std': cfg.NONLOCAL.CONV_INIT_STD}),
        bias_init=('ConstantFill', {'value': 0.}), no_bias=cfg.NONLOCAL.NO_BIAS)

    # we have to use explicit batch size (to support arbitrary spacetime size)
    # e.g., (8, 1024, 4, 14, 14) => (8, 1024, 784)
    theta, theta_shape_5d = model.Reshape(
        theta, [theta + '_re' if not cfg.MODEL.ALLOW_INPLACE_RESHAPE else theta,
            theta + '_shape5d'],
        shape=(batch_size, dim_inner, -1))
    phi, phi_shape_5d = model.Reshape(
        phi, [phi + '_re' if not cfg.MODEL.ALLOW_INPLACE_RESHAPE else phi,
            phi + '_shape5d'],
        shape=(batch_size, dim_inner, -1))
    g, g_shape_5d = model.Reshape(
        g, [g + '_re' if not cfg.MODEL.ALLOW_INPLACE_RESHAPE else g,
            g + '_shape5d'],
        shape=(batch_size, dim_inner, -1))

    # e.g., (8, 1024, 784) * (8, 1024, 784) => (8, 784, 784)
    theta_phi = model.net.BatchMatMul([theta, phi], prefix + '_affinity', trans_a=1)
    if cfg.NONLOCAL.USE_SOFTMAX is True:
        if cfg.NONLOCAL.USE_SCALE is True:
            theta_phi_sc = model.Scale(theta_phi, theta_phi, scale=dim_inner**-.5)
        else:
            theta_phi_sc = theta_phi
        # softmax
        # sum(p[i, j, :]) == 1, for any i, j
        p = model.Softmax(theta_phi_sc, theta_phi + '_prob', engine='CUDNN', axis=2)
    else:
        ones = model.net.ConstantFill([theta_phi], [theta_phi + '_ones'], value=1.)
        ones = model.net.ReduceBackSum([ones], [theta_phi + '_const'])

        zeros = model.net.ConstantFill([theta_phi], [theta_phi + '_zeros'], value=0.)
        denom = model.net.Add(
            [zeros, ones], [theta_phi + '_denom'], broadcast=1, axis=0)

        model.StopGradient(denom, denom)
        p = model.net.Div([theta_phi, denom], [theta_phi + '_sc'])

    # note: g's axis[2] corresponds to p's axis[2]
    # e.g., g(8, 1024, 784_2) * p(8, 784_1, 784_2) => (8, 1024, 784_1)
    t = model.net.BatchMatMul([g, p], prefix + '_y', trans_b=1)

    # reshape back:
    # e.g., (8, 1024, 784) => (8, 1024, 4, 14, 14)
    t_re, t_shape = model.Reshape(
        [t, theta_shape_5d],
        [t + '_re' if not cfg.MODEL.ALLOW_INPLACE_RESHAPE else t,
            t + '_shape3d'])
    blob_out = t_re

    blob_out = model.ConvNd(
        blob_out, prefix + '_out',
        dim_inner,
        dim_out,
        [1, 1, 1],
        strides=[1, 1, 1],
        pads=[0, 0, 0] * 2,
        weight_init=('GaussianFill', {'std': cfg.NONLOCAL.CONV_INIT_STD})
        if not cfg.NONLOCAL.USE_ZERO_INIT_CONV else
        ('ConstantFill', {'value': 0.}),
        bias_init=('ConstantFill', {'value': 0.}), no_bias=cfg.NONLOCAL.NO_BIAS)

    if cfg.NONLOCAL.USE_BN is True:
        blob_out = model.SpatialBN(
            blob_out, prefix + "_bn", dim_out,
            epsilon=cfg.NONLOCAL.BN_EPSILON, momentum=cfg.NONLOCAL.BN_MOMENTUM,
            is_test=is_test
        )
        model.param_init_net.ConstantFill(
            [prefix + "_bn_s"], prefix + "_bn_s", value=cfg.NONLOCAL.BN_INIT_GAMMA)

    if cfg.NONLOCAL.USE_AFFINE is True:
        blob_out = model.AffineNd(blob_out, prefix + "_bn", dim_out)

    return blob_out

In fact, it uses MatMul instead of conv op.

Aug 27 '18 00:08 ZhiboRao

In fact, conv op is a MatMul in maths.in filePNon-Local_Nets-Tensorflow/ops.py: there are the implement of the NonLocalBlock: used conv and matmul ops too.it seems to be a embedded version. def NonLocalBlock(input_x, out_channels, sub_sample=True, is_bn=True, scope='NonLocalBlock'): batchsize, height, width, in_channels = input_x.get_shape().as_list() with tf.variable_scope(scope) as sc: with tf.variable_scope('g') as scope: g = slim.conv2d(input_x, out_channels, [1,1], stride=1, scope='g') if sub_sample: g = slim.max_pool2d(g, [2,2], stride=2, scope='g_max_pool')

    with tf.variable_scope('phi') as scope:
        phi = slim.conv2d(input_x, out_channels, [1,1], stride=1, scope='phi')
        if sub_sample:
            phi = slim.max_pool2d(phi, [2,2], stride=2, scope='phi_max_pool')

    with tf.variable_scope('theta') as scope:
        theta = slim.conv2d(input_x, out_channels, [1,1], stride=1, scope='theta')

    g_x = tf.reshape(g, [batchsize,out_channels, -1])
    g_x = tf.transpose(g_x, [0,2,1])

    theta_x = tf.reshape(theta, [batchsize, out_channels, -1])
    theta_x = tf.transpose(theta_x, [0,2,1])
    phi_x = tf.reshape(phi, [batchsize, out_channels, -1])

    f = tf.matmul(theta_x, phi_x)
    # ???
    f_softmax = tf.nn.softmax(f, -1)
    y = tf.matmul(f_softmax, g_x)
    y = tf.reshape(y, [batchsize, height, width, out_channels])
    with tf.variable_scope('w') as scope:
        w_y = slim.conv2d(y, in_channels, [1,1], stride=1, scope='w')
        if is_bn:
            w_y = slim.batch_norm(w_y)
    z = input_x + w_y
    return z

Oct 09 '18 16:10 675492062

@675492062 I wonder why this implementation use g_x = tf.reshape(g, [batchsize,out_channels, -1]) g_x = tf.transpose(g_x, [0,2,1]) after computing g_x or theta_x. In my opinion, the output of slim.conv2d should be [batch_size, height, width, out_channels]. So, if we reshape the output to [batchsize,out_channels, -1] and transpose it, I think it may mass up the dimension of matrix. Why not just reshape to [batchsize,-1, out_channels]?

Oct 30 '18 07:10 YongyiTang92

@YongyiTang92 Yes,I thank so.The code on hithub website is not guaranteed to be correct.Ha-ha！

Oct 31 '18 09:10 675492062

@YongyiTang92 @675492062 Thank you for reminding, I will check the codes and revise them if it is necessary.

Dec 17 '18 02:12 nnUyi

@RaoHaocheng Actually, matmul is another version of Non-local block.

Dec 17 '18 02:12 nnUyi

@nnUyi Well, have you trained it on other datasets?

Jan 05 '19 11:01 leesky1c

@675492062 I wonder why this implementation use g_x = tf.reshape(g, [batchsize,out_channels, -1]) g_x = tf.transpose(g_x, [0,2,1]) after computing g_x or theta_x. In my opinion, the output of slim.conv2d should be [batch_size, height, width, out_channels]. So, if we reshape the output to [batchsize,out_channels, -1] and transpose it, I think it may mass up the dimension of matrix. Why not just reshape to [batchsize,-1, out_channels]?

I have tried changing the reshape operation, however, no matter whether changing it or not, the non local model does not show better performance. I tried to delete the code "nonlocal_block1 = NonLocalBlock(cnv1_pool, 32, scope='nonlocal_block1')" and "nonlocal_block2 = NonLocalBlock(cnv2_pool, 64, scope='nonlocal_block2')" to disable non local model. To my surprise, the network without non local model yields better results. Have you experienced this? Thanks~

Aug 19 '20 07:08 hnyz979

@675492062 I wonder why this implementation use g_x = tf.reshape(g, [batchsize,out_channels, -1]) g_x = tf.transpose(g_x, [0,2,1]) after computing g_x or theta_x. In my opinion, the output of slim.conv2d should be [batch_size, height, width, out_channels]. So, if we reshape the output to [batchsize,out_channels, -1] and transpose it, I think it may mass up the dimension of matrix. Why not just reshape to [batchsize,-1, out_channels]?

I have tried changing the reshape operation, however, no matter whether changing it or not, the non local model does not show better performance. I tried to delete the code "nonlocal_block1 = NonLocalBlock(cnv1_pool, 32, scope='nonlocal_block1')" and "nonlocal_block2 = NonLocalBlock(cnv2_pool, 64, scope='nonlocal_block2')" to disable non local model. To my surprise, the network without non local model yields better results. Have you experienced this? Thanks~

Just have more try with regards to difference task. Of course, make sure your ideas and code are correct. According to my experiment, some tasks are improved a little, but they are very small, and others have not be improved all.

Aug 19 '20 14:08 675492062