trafficstars

NanoPoor

NanoGPT-speedrunning for the poor T4 enjoyers

Inspired by Modded NanoGPT and my goat Jonas Geiping (Cramming), I trained a custom GPT I've been working on over at Dagonet, got to the 3.28 val loss on a single T4.

Important! Note/Future-bugifx

As @main_horse pointed out, I wrote a method that had the DSMoE class send the current tok to all experts, then apply router weights, so it removed the hard selection of the router, and made it more of a soft weighing instead, the hard routing is loss ~0.1 lower, or about 10 steps faster, but wallclock time per step is 2x longer and init was 8x longer, working on GEMMs

caveats:

Less than the 120M from main speedrun for stability
was just a 1B subset of finewebedu10b, not filtered or anything I just processed that much at this time, will probably fix this later

Runs

Ranking	Time - date	Data	Person	Description	log
1	7.09m - 4/5/25	~3.27M tok (1024 * 8 * 4 * 100)	Vatsa	now GPT-2 tokenizer shrunk vocab_size, and also shrunk head_lm and n_experts for stability, less params, now at ~73m	log
2	11.69m - 4/4/25	~3.93M tok (1024 * 8 * 4 * 120)	Vatsa	lr tuning (5e-4)	log
3	14.86m - 4/2/25	~5.21M tok (1024 * 8 * 4 * 160)	Vatsa	3x lr, removed ckpt saves every step, less printing	log
4	15.04m - 4/1/25	~3.89M tok (1024 * 5 * 4 * 190)	Vatsa	Used Muon instead	log
5	37.17m - 4/1/25	~6.14M tok (1024 * 5 * 4 * 300)	Vatsa	Added PSGD	log
6	70.61m - 3/31/25	~14M tok (1024 * 6 * 4 * 570)	Vatsa	First Run, has DS-MoE, MLA+NSA hybrid, Rope, etc	log

Unofficial Runs

Ranking	Time - date	Data	Person	Description	log
1st	7.63m - 4/1/25	~6.96M tok (1024 * 10 * 4 * 170)	Vatsa	Used an A100 with (15.04m - 4/1/25) run to see how I look on a real GPU	log

NanoPoor
NanoPoor copied to clipboard

Metadata

NanoPoor

Runs

Unofficial Runs

← Metadata

Owner

Metadata

NanoPoor NanoPoor copied to clipboard

Metadata

NanoPoor

Runs

Unofficial Runs

← Metadata

Owner

Metadata

NanoPoor
NanoPoor copied to clipboard