Bug with learning rate for Adagrad Optimizer?
Hi, I notice that in line 140 of glove.c, we do fdiff *= eta, which is then used in line 145 and 146 for the calculation of temp1 and temp2.
/* Adaptive gradient updates */
fdiff *= eta; // for ease in calculating gradient
real W_updates1_sum = 0;
real W_updates2_sum = 0;
for (b = 0; b < vector_size; b++) {
// learning rate times gradient for word vectors
temp1 = fdiff * W[b + l2];
temp2 = fdiff * W[b + l1];
// adaptive updates
W_updates1[b] = temp1 / sqrt(gradsq[b + l1]);
W_updates2[b] = temp2 / sqrt(gradsq[b + l2]);
W_updates1_sum += W_updates1[b];
W_updates2_sum += W_updates2[b];
gradsq[b + l1] += temp1 * temp1;
gradsq[b + l2] += temp2 * temp2;
}
However, since we also use temp1 and temp2 to calculate gradsq[b + l1], eta is squared along with the gradients, which means the following:
such that the learning rate cancels out. What would then happen is for the first step of every word, we would learn the first coordinate with learning rate eta, and all subsequent coordinates with a learning rate of 1.0; after that the learning rate is 1.0 for all coordinates.
Same applies to the training of the biases.
Is this intentional?
Thanks, Hugo
It seems to me that by including the learning rate, the squared gradient is smaller (a smaller number is squared). Therefore the updates will not shrink in size as fast as when the full gradient squared is used. Because the squared sum will not grow as fast.
Not exactly to Adagrad specification?
If I am wrong here, please correct me! :-)