optax
optax copied to clipboard
Adafactor: Update Clipping vs Gradient Clipping
The Adafactor paper (section 6) suggests "update clipping" instead of the usual "gradient clipping".
"Update clipping" here means clipping the update after all the fancy moment stuff has been applied (line 9 above)
However, the alias defined in alias.py clips the initial gradients:
https://github.com/deepmind/optax/blob/master/optax/_src/alias.py#L132-L140
Is this deliberate? FWIW, t5x's adafactor implements update clipping: https://github.com/google-research/t5x/blob/03dfc44be7f9a93d34c1d7fd6f896d1c364a7d4d/t5x/adafactor.py#L470-L476
Thanks for the question!
@mtthss could you comment on this, since you're the one who ported Adafactor from Flax?