Jianbin Chang
Jianbin Chang
Hi Edward, thank you very much, your advice saved me. A larger learning rate exposed the problem, the plot showed jitter, I debugged and fixed the problem, and now the...
Hi, @edwardjhu I've recently done some experiments, an extension of the previous discussion. I found that transferring the same hyperparameters from a 350M model to 1.3B scale works fine, but...
Another question is that the [transformer example](https://github.com/microsoft/mup/blob/main/examples/Transformer/model.py#L174) and [mutransformers](https://github.com/microsoft/mutransformers/blob/ed0e4af9700247e2067a131c2757a85133ab7d09/mutransformers/models/gpt2/modeling_gpt2.py#L475) use different initialization methods, `(init_std / d_model) ** 0.5` vs `init_std * width_mult ** -0.5`, are these two formulas equivalent in...
> I also tried this with a transformer based model and found similar results where the transferred HPs did not result in better performance. Hi @zanussbaum , I think the...
> Thanks! You are right that the advantage of muP over SP should become more apparent as the difference in width grows. As a direct consequence, the effect of random...
@FrankLeeeee @kurisusnowdeng Thanks for your response, it solved my confusion perfectly! BTW, I found gpt example has excellent scaling efficiency, but not good at computing performance. Under the same hyperparameter...
> We also provide an average result after each epoch. Is that number also abnormal? @kurisusnowdeng I haven't run a full epoch, but I have a task that will run...
@kurisusnowdeng The average result for Epoch is 32.005 which is closer to throughput than iteration time.