AdaptSum
AdaptSum copied to clipboard
question about learning rate
In 《attention is all you need》,lrate in noam decay they used is formulated as:
"lrate = d−0.5 · min(step_num−0.5, step_num · warmup_steps−1.5)"
But in your code,I found there is an original_lr which is 0.05:
self._set_rate(
self.original_lr *
( self.model_size ** -0.5 * min(self._step ** (-0.5),
self._step * self.warmup_steps**(-1.5))))
Why do we need to add this term?