cnn-text-classification-tf
cnn-text-classification-tf copied to clipboard
How to constrain L2-Norm of weights in the last layer as Kim did?
I'm new to both NLP and TensorFlow.
I found a way to constrain gradients as the following code:
global_step = tf.Variable(0, name="global_step", trainable=False)
optimizer = tf.train.AdamOptimizer(1e-3)
grads_and_vars = optimizer.compute_gradients(cnn.loss)
for i, (g, v) in enumerate(grads_and_vars):
if g is not None:
grads_and_vars[i] = (tf.clip_by_norm(g, 3), v)
train_op = optimizer.apply_gradients(grads_and_vars, global_step=global_step)
I was thinking whether I can modify weights here and I tried
grads_and_vars[i] = (tf.clip_by_norm(g, 3), (tf.clip_by_norm(v, 3))
. However, that didn't work :(
Can I get some help from you?
Here is what I did.
# Final (unnormalized) scores and predictions
with tf.name_scope("output"):
self.output_W = tf.get_variable(
"W",
shape=[num_filters_total, num_classes],
initializer=tf.contrib.layers.xavier_initializer())
b = tf.Variable(tf.constant(0.1, shape=[num_classes]), name="b")
l2_loss += tf.nn.l2_loss(self.output_W)
l2_loss += tf.nn.l2_loss(b)
self.scores = tf.nn.xw_plus_b(self.h_drop, self.output_W, b, name="scores")
self.predictions = tf.argmax(self.scores, 1, name="predictions")
def train_step(x_batch, y_batch):
feed_dict = {
cnn.input_x: x_batch,
cnn.input_y: y_batch,
cnn.dropout_prob: FLAGS.dropout_prob
}
_, step, summaries, loss, accuracy = sess.run(
[train_op, global_step, train_summary_op, cnn.loss, cnn.accuracy],
feed_dict)
sess.run(cnn.output_W.assign(tf.clip_by_norm(cnn.output_W, 3.0)))
time_str = datetime.datetime.now().isoformat()
print("{}: step {}, loss {:g}, acc {:g}".format(time_str, step, loss, accuracy))
train_summary_writer.add_summary(summaries, step)
The accuracy becomes 72% but not 76% as you said in your blog,
I experimented with adding additional L2 penalties for the weights at the last layer and was able to bump up the accuracy to 76%, close to that reported in the original paper
I'm not sure if I did it correctly :(
it seems that you clip w after all the training steps. should clip be added to each gradient desc step? clip is operated on w to norm, not on gradient preventing grad explo.
Those codes for clipping W is in file text_cnn.py. So I think clipping is performed every step, at least it's what I want. I might be wrong cos I'm not familiar with TF 😢
As for that clipping operation for grad, someone told me it's a good idea and I don't think it bad. Too eager to learn, right?
can someone please shed some light on why the output of l2_loss(b) is added to the same variable as l2_loss(self.out_put_W) ?
l2_loss += tf.nn.l2_loss(self.output_W) l2_loss += tf.nn.l2_loss(b)
@hkhatod This is called L2-Normalization. The general idea is to keep structural risk minimal.
I am also considering this issue. I think you can refer to this original theano code. https://github.com/yoonkim/CNN_sentence/blob/master/conv_net_sentence.py#L227
It seems except for the first layer, all weights are clipped. But I think clip the classifier layer is not proper, since it may affect the final probability for classifier scores.
@csyanbin Agree. In some papers, they claim clipping is not elegant. It's more like an empirical trick.