Dirk Groeneveld

Results 200 comments of Dirk Groeneveld

What's going on with this PR? Can we merge?

Can you link me to the wandb? This might be a graphing issue with wandb. Wandb will use different sampling depending on how long the run is. Shorter runs appear...

Section 3.2 does not graph activations, it graphs gradients, and it shows the gradient norm across training steps, not through the layers. We have not checked whether activations grow as...

Why does the mean of the activations start below zero?

I thought you'd be plotting the mean of the _absolute_ values. Otherwise it doesn't really show the effect you're after, which is (I think) that the magnitude of the activations...

Let me know when this is ready for another review?

I think this is more a question of the interconnect you have. How are the A100s connected to each other? You can probably get a fair bit of extra performance...

Not sure you wanted to tag me? I have nothing to do with Thor. Maybe @chrisc36 knows who does?

Oh, I see. You put a reference in the description 🙈. Paper says you pushed this to 1B/100B tokens. Can you go further? Experience says, things like this stop working...