datablations wonder if LR=1e-3 for mup is optimal value from small-scale proxy model and dropout is crucial for multi-epoch

wonder if LR=1e-3 for mup is optimal value from small-scale proxy model and dropout is crucial for multi-epoch

Open SeunghyunSEO opened this issue 1 year ago • 4 comments

hi authors, thanks for the great work! i just wonder if LR=1e-3 for mup is optimal value from small-scale proxy model and how dropout is critical for multi-epoch training. for the latter, i guess you guys set dropout as 0.1 for regularization but there is no dropout ablation study. because it's common to set dropout as 0.0 in modern LLM, it would be interesting to know when dropout becomes important

Jul 22 '24 08:07 SeunghyunSEO

datablations datablations copied to clipboard

wonder if LR=1e-3 for mup is optimal value from small-scale proxy model and dropout is crucial for multi-epoch

datablations
datablations copied to clipboard