Jianbin Chang comments

Results 8 comments of


                                            Jianbin Chang

Coord check looks good, but μTransfer is not working as expected

Hi Edward, thank you very much, your advice saved me. A larger learning rate exposed the problem, the plot showed jitter, I debugged and fixed the problem, and now the...

Coord check looks good, but μTransfer is not working as expected

Hi, @edwardjhu I've recently done some experiments, an extension of the previous discussion. I found that transferring the same hyperparameters from a 350M model to 1.3B scale works fine, but...

Coord check looks good, but μTransfer is not working as expected

Another question is that the [transformer example](https://github.com/microsoft/mup/blob/main/examples/Transformer/model.py#L174) and [mutransformers](https://github.com/microsoft/mutransformers/blob/ed0e4af9700247e2067a131c2757a85133ab7d09/mutransformers/models/gpt2/modeling_gpt2.py#L475) use different initialization methods, `(init_std / d_model) ** 0.5` vs `init_std * width_mult ** -0.5`, are these two formulas equivalent in...

Conv1D Coord check looks good (I think), but μTransfer does not seem to work?

> I also tried this with a transformer based model and found similar results where the transferred HPs did not result in better performance. Hi @zanussbaum , I think the...

Conv1D Coord check looks good (I think), but μTransfer does not seem to work?

> Thanks! You are right that the advantage of muP over SP should become more apparent as the difference in width grows. As a direct consequence, the effect of random...

Questions about log interpretation, seems paradoxical

@FrankLeeeee @kurisusnowdeng Thanks for your response, it solved my confusion perfectly! BTW, I found gpt example has excellent scaling efficiency, but not good at computing performance. Under the same hyperparameter...

Questions about log interpretation, seems paradoxical

> We also provide an average result after each epoch. Is that number also abnormal? @kurisusnowdeng I haven't run a full epoch, but I have a task that will run...

Questions about log interpretation, seems paradoxical

@kurisusnowdeng The average result for Epoch is 32.005 which is closer to throughput than iteration time.