Andrej
Andrej
I haven't experimented with this much, my understanding is that papers leave this at random and let the optimization figure it out? I could be wrong, I don't recall papers...
Oh as to why we're using absolute positional embedding scheme, I am just following GPT-2 for now. I want to include more elaborate schemes (relative positional embeddings, rotary, etc.), currently...
The eval_interval is 2000 by default, maybe then?
each batch has 12 * 1024 tokens, because 1024 is the block size. All of those tokens get trained on in parallel.
(I'm still thinking about it! Was meaning to profile both version and see if i could simplify but keep the speedups in any way. Re-opening ty)
I don't know how I feel about tqdm yet...
lol nice try. we're still going to keep our jobs for a bit longer :)
great!! running some benchmarking and adjusting the code slightly...
(also I think this code is wrong because the line ``` y = att @ v # (B, nh, T, T) x (B, nh, T, hs) -> (B, nh, T,...