AdaptSum question about learning rate

question about learning rate

Open HiXiaochen opened this issue 3 years ago • 0 comments

In 《attention is all you need》，lrate in noam decay they used is formulated as： "lrate = d−0.5 · min(step_num−0.5, step_num · warmup_steps−1.5)" But in your code，I found there is an original_lr which is 0.05： self._set_rate( self.original_lr *
( self.model_size ** -0.5 * min(self._step ** (-0.5),
self._step * self.warmup_steps**(-1.5)))) Why do we need to add this term?

Sep 16 '21 09:09 HiXiaochen

AdaptSum AdaptSum copied to clipboard

question about learning rate

AdaptSum
AdaptSum copied to clipboard