Andrej

Results 373 comments of Andrej

Yes this is the typical training regime. There is a special END OF TEXT token separating them, so the model is expected to learn that this token separates unrelated documents.

Can you explain more? Why does this improve compute efficiency?

So - you're right about your concerns, but not exactly. I spent much less time on nanoGPT from inference standpoint. Calculating and passing in attention mask is one way to...

Yes, big difference for any additional loss, the farther and farther you get in training.

The first few amounts of loss are just the most boring things, like learning that sentences end with ".", and that spaces are important. All the interesting stuff gets learned...

At the scale of nanoGPT basically the answer is no. ICL (in context learning) emerges a few B parameters down the road.

This commit does two things: the thing you mentioned but also it introduces new variables for train/val paths...

Yeah apparently it isn't all of Shakespeare. Silly but I wasn't aware of it, or more likely I forgot that by now :D. Would love the full works of Shakespeare...

That's nice, but prefer we keep `n_layer_update` separate

I don't know I don't really like these platforms too much and they usually irritate me with dark patterns when I stop by. I don't want to sign up for...