modded-nanogpt
modded-nanogpt copied to clipboard
NanoGPT (124M) in 5 minutes
Changes to make the code run on RTX 4090 / 3090. Fixes https://github.com/KellerJordan/modded-nanogpt/issues/29 [Runs in 2 hours 3 minutes](https://gist.github.com/lapp0/ff6f10c3cd6d0aefb28a49681a44b78c), Runs range from 3.275 to 3.285. This finished at 3.2817, These...
The current implementation converts trigonometric values (cos_cached and sin_cached) to bfloat16, which introduces significant precision issues. This degrades the relative positional encoding properties of RoPE, particularly in **long-context** scenarios, as...
https://arxiv.org/abs/2411.16085 claims to improve optimization speed a lot. Thus wondering whether this is helpful for the speedrun (maybe applicable to current optimizer? maybe need to switch another one?).
Hi thanks for the great repo! I would appreciate it if there can be a speed run on consumer cards e.g. RTX4090. Since it is 125M params, the RTX4090's 24GB...
how to do inference?
This doesn't appear to significantly improve the loss, but it does speed up training by ~1% (on 1xH100), by splitting the orthonormalization task into `n_head` sub-tasks. Not sure if that's...
  ## ChangeLog * **Added UNet connectivity structure on the value embeddings**. This allowed us to reduce the number of value embeddings from 12 to 6 and the total...
Replace the fixed [12](https://github.com/KellerJordan/modded-nanogpt/blob/973030408364f8738b4ad9e8f912d8cbbf56e4d4/train_gpt2.py#L246) and [12](https://github.com/KellerJordan/modded-nanogpt/blob/973030408364f8738b4ad9e8f912d8cbbf56e4d4/train_gpt2.py#L268) by the `n_layer` in the config
This seems advisable because the dataloader uses numpy extensively and oddities may crop up later with numpy updates.