nanoGPT issues

Just a question

2

If I understand correctly, you have max 600000 iterations times batches of 12, which is roughly 7M training examples fed to the transformer, way smaller than the 9B tokens of...

jpbruneton

Dataset load

5

Hello I've an issue while loading my dataset in prepare.py (for obenwebtext). The download and the extraction complete successfully but the generation of train split raise an error. I've already...

Emilien212

Cache the KV projection history when generating

1

This PR is a mostly failed attempt to fix [issue #95](https://github.com/karpathy/minGPT/issues/95) from the minGPT repo. The idea is to save the results of key and value projections in each self-attention...

dfyz

Cuda out of Memory

3

Try on a cluster using multiple nodes. Example: 1) run "torchrun --nproc_per_node=2 --nnodes=2 --node_rank=0 --master_addr= --master_port=1234 train.py --dataset=shakespeare --dtype=float16 --batch_size=2 --compile=False" Got errors: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to...

hanfluid

Cannot run train due to flash

I noticed the comment that you're using torch 2.0 and if you encounter warnings to set `--compile=False` Problem I'm running into is flash is auto-detected # flash attention make GPU...

david-waterworth

Implement's torch SDPA for FlashAttention Kernel

9

Implements torch sdpa for mem_efficient kernel support! Using the mem_efficient kernel results in a ~15.5% faster training time per batch, going from a ~154ms/batch baseline to ~130ms/batch. (Ran on 8...

LucasLLC