Language-Modeling-GatedCNN icon indicating copy to clipboard operation
Language-Modeling-GatedCNN copied to clipboard

While training the model can see all words (beside the last one)

Open talbaumel opened this issue 8 years ago • 11 comments

Lets say a sentence in the data set is (1,2,3,4) Then prepare_data function will create: X = (1,2,3) Y = (2,3,4)

While predicting 2 and 3 your model can copy them from the input

talbaumel avatar Jan 15 '17 10:01 talbaumel

When convolving inputs, the zero-padding added to the top rows of input layer makes sure that a hidden state does not contain information from future words.

anantzoid avatar Jan 16 '17 03:01 anantzoid

I feel like zero padding should be used in every convolution layer. like this https://github.com/openai/pixel-cnn/blob/master/pixel_cnn_pp/nn.py#L296.

ruotianluo avatar Jan 17 '17 20:01 ruotianluo

@ruotianluo Zero padding is used in every layer to keep the layer size same: https://github.com/anantzoid/Language-Modeling-GatedCNN/blob/master/model.py#L62 The zero padding I referred to in the above comment is the extra padding required to prevent the filter from seeing the future words.

anantzoid avatar Jan 20 '17 21:01 anantzoid

only padding mask_layer[:,0:conf.filter_h/2,:] = 0 can prevent the filter from seeing the future words? why not (conf.filter_h-1)

wangwang110 avatar Feb 15 '17 14:02 wangwang110

only padding at the first layer can prevent the filter from seeing the future words? sorry,i can't understand it, can you tell me in detail . thank you very much.

wangwang110 avatar Feb 16 '17 05:02 wangwang110

Yes, I have the same concern here. I output some trace messages:

xbatch[0] = [[ 1 1 3 13 123 5 12 152 7 84 129 21 106 48 5 14 89 30 6 140 6] [ 57 88 5 25 60 23 2 4 1 1 3 13 51 10 22 136 68 28 105 6 52] [104 121 11 54 10 134 10 138 22 64 151 47 133 69 2 4 1 1 3 13 97]]

ybatch[0] = [[ 1 3 13 123 5 12 152 7 84 129 21 106 48 5 14 89 30 6 140 6 118] [ 88 5 25 60 23 2 4 1 1 3 13 51 10 22 136 68 28 105 6 52 90] [121 11 54 10 134 10 138 22 64 151 47 133 69 2 4 1 1 3 13 97 46]]

always comes out to 1. I have changed the batch size to 3 so it's easier to look at. Everything else is default.

I looked at the model code, and it's basically trying to take (for example): [ 1 1 3 13 123 5 12 152 7 84 129 21 106 48 5 14 89 30 6 140 6]

into [ 1 3 13 123 5 12 152 7 84 129 21 106 48 5 14 89 30 6 140 6 118]

Here is the gist of the model:

    tf.reset_default_graph()

    self.X = tf.placeholder(shape=[conf.batch_size, conf.context_size-1], dtype=tf.int32, name="X")
    self.y = tf.placeholder(shape=[conf.batch_size, conf.context_size-1], dtype=tf.int32, name="y")

    embed = self.create_embeddings(self.X, conf)
    h, res_input = embed, embed

    for i in range(conf.num_layers):
        fanin_depth = h.get_shape()[-1]
        filter_size = conf.filter_size if i < conf.num_layers-1 else 1
        shape = (conf.filter_h, conf.filter_w, fanin_depth, filter_size)

        with tf.variable_scope("layer_%d"%i):
            conv_w = self.conv_op(h, shape, "linear")
            conv_v = self.conv_op(h, shape, "gated")
            h = conv_w * tf.sigmoid(conv_v)
            if i % conf.block_size == 0:
                h += res_input
                res_input = h
    h = tf.reshape(h, (-1, conf.embedding_size))
    y_shape = self.y.get_shape().as_list()
    self.y = tf.reshape(self.y, (y_shape[0] * y_shape[1], 1))

    softmax_w = tf.get_variable("softmax_w", [conf.vocab_size, conf.embedding_size], tf.float32,
                                tf.random_normal_initializer(0.0, 0.1))
    softmax_b = tf.get_variable("softmax_b", [conf.vocab_size], tf.float32, tf.constant_initializer(1.0))

    #Preferance: NCE Loss, heirarchial softmax, adaptive softmax
    self.loss = tf.reduce_mean(tf.nn.nce_loss(softmax_w, softmax_b, h, self.y, conf.num_sampled, conf.vocab_size))

    trainer = tf.train.MomentumOptimizer(conf.learning_rate, conf.momentum)
    gradients = trainer.compute_gradients(self.loss)
    clipped_gradients = [(tf.clip_by_value(_[0], -conf.grad_clip, conf.grad_clip), _[1]) for _ in gradients]
    self.optimizer = trainer.apply_gradients(clipped_gradients)
    self.perplexity = tf.exp(self.loss)

    self.create_summaries()

What is this zero padding you're talking about?

Are you talking about mask_layer: mask_layer[:,0:conf.filter_h/2,:] = 0 embed *= mask_layer Unless I am mistaken, this only zeroes out the the first few, which, you actually want to keep because that's the history.

Basically, a model that does y = x[1:]+[0] would do quite well... I would guess, then, that the gating layer is just allowing for a cleaner shift.

I feel like there is something I am missing. Maybe you can clarify?

Seems this implementation is bogus. Original implementation's for Torch. The paper doesn't describe how data preparation's done.

thangduong avatar Sep 11 '17 21:09 thangduong

@thangduong I agree with you. I found the mask and padding is only applied on the embedding layer, while the subsequent conv layers are not. I guess it may cause the future information will be peeked during the middle conv layers. What do you think about it now?

sonack avatar Sep 23 '18 05:09 sonack

@ruotianluo Zero padding is used in every layer to keep the layer size same: https://github.com/anantzoid/Language-Modeling-GatedCNN/blob/master/model.py#L62 The zero padding I referred to in the above comment is the extra padding required to prevent the filter from seeing the future words.

Do you mean there you used SAME padding, which would add zero padding to produce same size output and also prevent next conv layer to see the future information? If so, I dont think it is correct. Because the SAME padding in tensorflow should pad at left and right side as evenly as possible, if not possible, then let the right paddings more 1. But, if your want to prevent seesing future, the paddings should all be at the left side.

sonack avatar Sep 23 '18 05:09 sonack

@sonack I agree with you. We need padding filter_size-1 zeros to the left with each layer.

qixiang109 avatar Sep 25 '18 04:09 qixiang109

@qixiang109 Are you working on this gated cnn?do you have successfully reproduce the paper’s result?I hope we can communicate with each other :)

sonack avatar Sep 25 '18 06:09 sonack

@sonack 不好意思很久没看这边,我还没具体做过gated cnn语言模型的实验,但是我很确定补0的方式

qixiang109 avatar Nov 09 '18 13:11 qixiang109