Andrej

https://twitter.com/karpathy [email protected]

Stanford I like to train Deep Neural Nets on large datasets.

Results 373 comments of


                                            Andrej

Why using learnable position embedding just like token embedding?

I haven't experimented with this much, my understanding is that papers leave this at random and let the optimization figure it out? I could be wrong, I don't recall papers...

Why using learnable position embedding just like token embedding?

Oh as to why we're using absolute positional embedding scheme, I am just following GPT-2 for now. I want to include more elaborate schemes (relative positional embeddings, rotary, etc.), currently...

checkpoints don't seem to be working

The eval_interval is 2000 by default, maybe then?

Zero-grad more aggressively to save memory

:O ???

Just a question

each batch has 12 * 1024 tokens, because 1024 is the block size. All of those tokens get trained on in parallel.

OpenWebTextCorpus DataLoader

(I'm still thinking about it! Was meaning to profile both version and see if i could simplify but keep the speedups in any way. Re-opening ty)

Give love to tqdm too ;)

I don't know how I feel about tqdm yet...

Fix Issue with running prepare.py (modified repos/nanoGPT/data/openwebtext/prepare.py)

lol nice try. we're still going to keep our jobs for a bit longer :)

Implement's torch SDPA for FlashAttention Kernel

great!! running some benchmarking and adjusting the code slightly...

Implement's torch SDPA for FlashAttention Kernel

(also I think this code is wrong because the line ``` y = att @ v # (B, nh, T, T) x (B, nh, T, hs) -> (B, nh, T,...

‹
1
2
...
17
18
19
20
21
22
23
...
37
38
›