Andrej comments

Results 373 comments of


                                            Andrej

Use argparse in configurator.py

I'm trying to take this path it's just making things worse and more complicated :( E.g. now I can't do `--compile=False` because of argparse's opinions about boolean variables. Which I...

Tie LM Head Weight to Token Embedding to match official GPT2 Code

Thank you for the PR, yes this is weight tying, a common technique https://paperswithcode.com/method/weight-tying . It reduces the number of parameters, which is probably also very helpful for distributed training...

Tie LM Head Weight to Token Embedding to match official GPT2 Code

@vgoklani what is the test exactly? forward pass? It also breaks the current nanoGPT code, an error in `configure_optimizers`

Tie LM Head Weight to Token Embedding to match official GPT2 Code

So I tried to add support for tied weights here https://github.com/karpathy/nanoGPT/commit/7c8288552b3673574e0649e031963b8e7e8d4981 , TLDR it's just one line ```python self.lm_head.weight = self.transformer.wte.weight # https://paperswithcode.com/method/weight-tying ``` and then some trickery with configure_optimizers,...

Tie LM Head Weight to Token Embedding to match official GPT2 Code

ok i merged to master for now because things don't seem broken. will like to investigate the new warning produced separately to make sure everything is ok. Closing this PR...

Add gradient accumulation support

Ty I've been meaning to look into adding this, will review shortly.

Add gradient accumulation support

I changed things around a bit wrt semantics and implementation and ended up here: https://github.com/karpathy/nanoGPT/commit/cf9991488629b1b072c49bf261d04b0c8a3207a3 any thoughts? ty for opening the PR.

Add gradient accumulation support

ok! just for the record, some rough timings I saw: - simple 1GPU training: 350ms - DDP 4GPU training: 520ms (quite a bit of overhead from DDP is ~1.5X? :(...

Is it possible: davinci-003?

No. Not as is. There are two major stages to training these: the pretraining stage and the finetuning stage. This code does the former. The finetuning stage requires additional custom...

what is the main speed up trick for nanoGPT?

Just to clarify, we're talking about GPT-2 124M here?