Greg Yang

Results 12 comments of Greg Yang

@zhuzilin So `MuReadout` essentially just does the following ```python def forward(self, x): return super().forward( self.output_mult * x / self.weight.infshape.width_mult()) ``` If for some reason you can't use `MuReadout` as is,...

@zhuzilin @StellaAthena How is Deepspeed integration going? We can connect you with members of the Deepspeed team if necessary.

Integration with Flax would be fantastic, but neither I nor @edwardjhu are familiar with it. If someone from the Flax team can work with us, we can definitely advise the...

Hey @davisyoshida your repo looks great so far! For your plot, you'd get better results if you tune the input, output, and hidden learning rates for your small model and...

@googlebot I signed it!

Closing this issue for now, but feel free to re-open when there are new updates.

Hi Robert, Not sure I understand your problem exactly since I don’t see the rest of your code, but are you creating the base shapes for one depth value L1...

Adding on to Edward, the usual lr/batch_size dependency rule is when you fix the number of epochs, whereas here we are fixing the number of steps (because we are *shrinking*...

Hi @sgugger, is there any particular reason you say that `mup` is very targeted toward Transformers? We definitely designed `mup` with general models in mind, even though Transformers would be...

After discussion with Edward, we think perhaps hosting custom model code on the Hub would be the best way to go. We have some questions about this: 1. Is there...