Andrej
Andrej
I'm trying to take this path it's just making things worse and more complicated :( E.g. now I can't do `--compile=False` because of argparse's opinions about boolean variables. Which I...
Thank you for the PR, yes this is weight tying, a common technique https://paperswithcode.com/method/weight-tying . It reduces the number of parameters, which is probably also very helpful for distributed training...
@vgoklani what is the test exactly? forward pass? It also breaks the current nanoGPT code, an error in `configure_optimizers`
So I tried to add support for tied weights here https://github.com/karpathy/nanoGPT/commit/7c8288552b3673574e0649e031963b8e7e8d4981 , TLDR it's just one line ```python self.lm_head.weight = self.transformer.wte.weight # https://paperswithcode.com/method/weight-tying ``` and then some trickery with configure_optimizers,...
ok i merged to master for now because things don't seem broken. will like to investigate the new warning produced separately to make sure everything is ok. Closing this PR...
Ty I've been meaning to look into adding this, will review shortly.
I changed things around a bit wrt semantics and implementation and ended up here: https://github.com/karpathy/nanoGPT/commit/cf9991488629b1b072c49bf261d04b0c8a3207a3 any thoughts? ty for opening the PR.
ok! just for the record, some rough timings I saw: - simple 1GPU training: 350ms - DDP 4GPU training: 520ms (quite a bit of overhead from DDP is ~1.5X? :(...
No. Not as is. There are two major stages to training these: the pretraining stage and the finetuning stage. This code does the former. The finetuning stage requires additional custom...
Just to clarify, we're talking about GPT-2 124M here?