modded-nanogpt issues

Run on RTX 4090

10

Changes to make the code run on RTX 4090 / 3090. Fixes https://github.com/KellerJordan/modded-nanogpt/issues/29 [Runs in 2 hours 3 minutes](https://gist.github.com/lapp0/ff6f10c3cd6d0aefb28a49681a44b78c), Runs range from 3.275 to 3.285. This finished at 3.2817, These...

lapp0

RoPE has precision errors when used with BFloat16

1

The current implementation converts trigonometric values (cos_cached and sin_cached) to bfloat16, which introduces significant precision issues. This degrades the relative positional encoding properties of RoPE, particularly in **long-context** scenarios, as...

banyan-god

Has anyone tried the `Cautious Optimizer`?

6

https://arxiv.org/abs/2411.16085 claims to improve optimization speed a lot. Thus wondering whether this is helpful for the speedrun (maybe applicable to current optimizer? maybe need to switch another one?).

fzyzcjy

A speedrun on consumer grade cards?

27

Hi thanks for the great repo! I would appreciate it if there can be a speed run on consumer cards e.g. RTX4090. Since it is 125M params, the RTX4090's 24GB...

fzyzcjy

Inference

2

how to do inference?

HaiFengZeng

per-head Q, K, V gradient orthonormalization

8

This doesn't appear to significantly improve the loss, but it does speed up training by ~1% (on 1xH100), by splitting the orthonormalization task into `n_head` sub-tasks. Not sure if that's...

scottjmaddox

[Potential Record] Value Embed Unet - 4.11 mins

2

![image](https://github.com/user-attachments/assets/2576a90e-f47a-4ff1-8103-30a7952cb077) ![image](https://github.com/user-attachments/assets/5250b642-5c76-42aa-ab95-bfb415a1d9a6) ## ChangeLog * **Added UNet connectivity structure on the value embeddings**. This allowed us to reduce the number of value embeddings from 12 to 6 and the total...

leloykun

Replace the fixed 12 by the n_layer in the config

Replace the fixed [12](https://github.com/KellerJordan/modded-nanogpt/blob/973030408364f8738b4ad9e8f912d8cbbf56e4d4/train_gpt2.py#L246) and [12](https://github.com/KellerJordan/modded-nanogpt/blob/973030408364f8738b4ad9e8f912d8cbbf56e4d4/train_gpt2.py#L268) by the `n_layer` in the config

LinglongQian