datablations icon indicating copy to clipboard operation
datablations copied to clipboard

wonder if LR=1e-3 for mup is optimal value from small-scale proxy model and dropout is crucial for multi-epoch

Open SeunghyunSEO opened this issue 1 year ago • 4 comments

hi authors, thanks for the great work! i just wonder if LR=1e-3 for mup is optimal value from small-scale proxy model and how dropout is critical for multi-epoch training. for the latter, i guess you guys set dropout as 0.1 for regularization but there is no dropout ablation study. because it's common to set dropout as 0.0 in modern LLM, it would be interesting to know when dropout becomes important

SeunghyunSEO avatar Jul 22 '24 08:07 SeunghyunSEO