datablations
datablations copied to clipboard
wonder if LR=1e-3 for mup is optimal value from small-scale proxy model and dropout is crucial for multi-epoch
hi authors, thanks for the great work! i just wonder if LR=1e-3 for mup is optimal value from small-scale proxy model and how dropout is critical for multi-epoch training. for the latter, i guess you guys set dropout as 0.1 for regularization but there is no dropout ablation study. because it's common to set dropout as 0.0 in modern LLM, it would be interesting to know when dropout becomes important