Increasing coord check for the network output
I'm implementing muP for the OLMo model, and am facing an issue with the coordinate check.
The increasing l1 is for the network output. Following the docs, I also set readout init and query init to zero. I also ensure that the initialization is applied after set_base_shapes is called.
What other things can I check to debug the issue?
hi @AkshitaB , im reproducing MuP too these days. can you share the arch ?? or have you solved the problem?
@AkshitaB (very delayed reply but still might be helpful)
From my experience, I also tried query/readout zero-init and it didn't help. However, what I saw is that while growing at early iterations, the readout norms do stabilise across widths after a sufficient number of iterations (like 30). You might actually already see such hints on your plot for t=4, so maybe running coordinate check for longer steps will flatten your readout norms.
But even if not, it's never been a problem for me in practice to have muTransfer, most importantly is that the other layer norms looks flat, which is the case for you :)