BalancingGroups
BalancingGroups copied to clipboard
Weight decay for BERT models
Hi! I noticed that in your code for BERT AdamW optimizer you only apply weight decay to parameters that contain the strings bias
or LayerNorm.weight
:
https://github.com/facebookresearch/BalancingGroups/blob/72d31e56e168b8ab03348810d4c5bac0f8a90a7a/models.py#L41-L45
The original group DRO code seems to do the opposite and not apply weight decay to only those parameters:
https://github.com/kohpangwei/group_DRO/blob/master/train.py#L111-L114