modded-nanogpt icon indicating copy to clipboard operation
modded-nanogpt copied to clipboard

NanoGPT (124M) in 5 minutes

Results 44 modded-nanogpt issues
Sort by recently updated
recently updated
newest added

Changes to make the code run on RTX 4090 / 3090. Fixes https://github.com/KellerJordan/modded-nanogpt/issues/29 [Runs in 2 hours 3 minutes](https://gist.github.com/lapp0/ff6f10c3cd6d0aefb28a49681a44b78c), Runs range from 3.275 to 3.285. This finished at 3.2817, These...

The current implementation converts trigonometric values (cos_cached and sin_cached) to bfloat16, which introduces significant precision issues. This degrades the relative positional encoding properties of RoPE, particularly in **long-context** scenarios, as...

https://arxiv.org/abs/2411.16085 claims to improve optimization speed a lot. Thus wondering whether this is helpful for the speedrun (maybe applicable to current optimizer? maybe need to switch another one?).

Hi thanks for the great repo! I would appreciate it if there can be a speed run on consumer cards e.g. RTX4090. Since it is 125M params, the RTX4090's 24GB...

how to do inference?

This doesn't appear to significantly improve the loss, but it does speed up training by ~1% (on 1xH100), by splitting the orthonormalization task into `n_head` sub-tasks. Not sure if that's...

![image](https://github.com/user-attachments/assets/2576a90e-f47a-4ff1-8103-30a7952cb077) ![image](https://github.com/user-attachments/assets/5250b642-5c76-42aa-ab95-bfb415a1d9a6) ## ChangeLog * **Added UNet connectivity structure on the value embeddings**. This allowed us to reduce the number of value embeddings from 12 to 6 and the total...

Replace the fixed [12](https://github.com/KellerJordan/modded-nanogpt/blob/973030408364f8738b4ad9e8f912d8cbbf56e4d4/train_gpt2.py#L246) and [12](https://github.com/KellerJordan/modded-nanogpt/blob/973030408364f8738b4ad9e8f912d8cbbf56e4d4/train_gpt2.py#L268) by the `n_layer` in the config

This seems advisable because the dataloader uses numpy extensively and oddities may crop up later with numpy updates.