Greg Yang comments

Results 16 comments of


                                            Greg Yang

Is this compatible with DeepSpeed / ZeRO?

@zhuzilin So `MuReadout` essentially just does the following ```python def forward(self, x): return super().forward( self.output_mult * x / self.weight.infshape.width_mult()) ``` If for some reason you can't use `MuReadout` as is,...

Is this compatible with DeepSpeed / ZeRO?

@zhuzilin @StellaAthena How is Deepspeed integration going? We can connect you with members of the Deepspeed team if necessary.

integration with Flax?

Integration with Flax would be fantastic, but neither I nor @edwardjhu are familiar with it. If someone from the Flax team can work with us, we can definitely advise the...

integration with Flax?

Hey @davisyoshida your repo looks great so far! For your plot, you'd get better results if you tune the input, output, and hidden learning rates for your small model and...

Batchnorm

@googlebot I signed it!

Does mup work with model with Conv2D as output?

Closing this issue for now, but feel free to re-open when there are new updates.

Are Sequentials with list comprehension handled incorrectly?

Hi Robert, Not sure I understand your problem exactly since I don’t see the rest of your code, but are you creating the base shapes for one depth value L1...

Batch size, Seq len, Step Transfering

Adding on to Edward, the usual lr/batch_size dependency rule is when you fix the number of epochs, whereas here we are fixing the number of steps (because we are *shrinking*...

Implement Maximal Update Parametrization (muP)

Hi @sgugger, is there any particular reason you say that `mup` is very targeted toward Transformers? We definitely designed `mup` with general models in mind, even though Transformers would be...

Implement Maximal Update Parametrization (muP)

After discussion with Edward, we think perhaps hosting custom model code on the Hub would be the best way to go. We have some questions about this: 1. Is there...