nanoGPT Difference In Char-Shakespeare Training Time on A100 With Pytorch 1.16 vs. Nightly

Hi there, I just wanted to confirm the observed differences in training time and performance when using an A100 (like Karpathy) with 1.16 vs Nightly (slow attention and no compile). The speed difference (>8x) seems pretty large. However, I don't normally use torch (more TF), so I'm not certain if I'm doing something wrong or if this is normal.

The details are as follows:

Karpathy's benchmarks:

~3 minutes for 5000 iterations (36 milliseconds per iteration)
best validation loss is 1.4697
Sampled text:

ANGELO:
And cowards it be strawn to my bed,
And thrust the gates of my threats,
Because he that ale away, and hang'd
An one with him.

DUKE VINCENTIO:
I thank your eyes against it.

DUKE VINCENTIO:
Then will answer him to save the malm:
And what have you tyrannous shall do this?

DUKE VINCENTIO:
If you have done evils of all disposition
To end his power, the day of thrust for a common men
That I leave, to fight with over-liking
Hasting in a roseman.

My benchmarks

~30 minutes for 5000 iterations (250-350 milliseconds per iteration – 8-10x slower)
best validation loss is 1.4820 (at step 500!)
Sampled text:

YORK:
I trust with the bastard and thy beauteous graves
Beat taunt'st by his world vessary:
And affright, my father, when my knows I to thy heart
Hath discover'd thee by mine at Henry;
Give thee the king's eyes of life dispatch'd her death.

DUKE OF AUMERLE:
Here comes my father shall be wont
Because an hour brother king at thy right.

KING RICHARD II:
So such a dangerous for a comfort which I may beat
To fight with our royal banishment
Take day our souls to distraught it.

Feb 24 '23 16:02 darien-schettler

Hi @darien-schettler ,

I am running on 1x 4090 GPU and using pytorch 2.0 nighty build. Running python train.py config/train_shakespeare_char.py with default parameters:

eval_interval = 250 # keep frequent because we'll overfit eval_iters = 200 log_interval = 10 # don't print too too often always_save_checkpoint = False wandb_log = False # override via command line if you like wandb_project = 'shakespeare-char' wandb_run_name = 'mini-gpt' dataset = 'shakespeare_char' batch_size = 64 block_size = 256 # context of up to 256 previous characters n_layer = 6 n_head = 6 n_embd = 384 dropout = 0.2 learning_rate = 1e-3 # with baby networks can afford to go a bit higher max_iters = 5000 lr_decay_iters = 5000 # make equal to max_iters usually min_lr = 1e-4 # learning_rate / 10 usually beta2 = 0.99 # make a bit bigger because number of tokens per iter is small warmup_iters = 100 # not super necessary potentially

Here is the training summary along with logs on 1x 4090 GPU:

800ms-900ms per iteration: total (5000 x 900/(1000 x 60) = 75 mins of training time
I am also getting same validation loss as yours at step 500.

iter 0: loss 4.2648, time 39050.92ms, mfu -100.00% iter 10: loss 3.2203, time 873.04ms, mfu 17.07% iter 20: loss 2.7697, time 873.45ms, mfu 17.07% iter 30: loss 2.6103, time 872.74ms, mfu 17.07% iter 40: loss 2.5341, time 873.17ms, mfu 17.07% iter 50: loss 2.5006, time 872.40ms, mfu 17.07% iter 60: loss 2.4812, time 874.22ms, mfu 17.07% iter 70: loss 2.4785, time 872.38ms, mfu 17.07% iter 80: loss 2.4377, time 873.88ms, mfu 17.07% iter 90: loss 2.4303, time 873.09ms, mfu 17.07% iter 100: loss 2.3901, time 874.36ms, mfu 17.07% iter 110: loss 2.3902, time 873.71ms, mfu 17.07% iter 120: loss 2.3301, time 904.70ms, mfu 17.01% iter 130: loss 2.2632, time 894.35ms, mfu 16.97% iter 140: loss 2.2051, time 882.74ms, mfu 16.97% iter 150: loss 2.1421, time 884.48ms, mfu 16.95% iter 160: loss 2.0592, time 882.61ms, mfu 16.95% iter 170: loss 1.9929, time 875.93ms, mfu 16.95% iter 180: loss 1.9397, time 872.51ms, mfu 16.97% iter 190: loss 1.8576, time 872.87ms, mfu 16.98% iter 200: loss 1.8480, time 873.47ms, mfu 16.99% iter 210: loss 1.7974, time 873.83ms, mfu 16.99% iter 220: loss 1.7305, time 872.38ms, mfu 17.00% iter 230: loss 1.7193, time 874.67ms, mfu 17.01% iter 240: loss 1.6566, time 873.58ms, mfu 17.01% step 250: train loss 1.5547, val loss 1.7287 saving checkpoint to out-shakespeare-char-test iter 250: loss 1.6339, time 3529.40ms, mfu 15.73% iter 260: loss 1.6051, time 876.64ms, mfu 15.86% iter 270: loss 1.5910, time 873.80ms, mfu 15.98% iter 280: loss 1.5451, time 874.02ms, mfu 16.09% iter 290: loss 1.5214, time 874.05ms, mfu 16.18% iter 300: loss 1.5130, time 874.74ms, mfu 16.27% iter 310: loss 1.4959, time 873.85ms, mfu 16.35% iter 320: loss 1.4865, time 878.16ms, mfu 16.41% iter 330: loss 1.4860, time 874.85ms, mfu 16.47% iter 340: loss 1.4471, time 873.93ms, mfu 16.53% iter 350: loss 1.4191, time 873.80ms, mfu 16.58% iter 360: loss 1.3754, time 876.43ms, mfu 16.63% iter 370: loss 1.3651, time 874.00ms, mfu 16.67% iter 380: loss 1.3434, time 877.09ms, mfu 16.70% iter 390: loss 1.3180, time 874.38ms, mfu 16.74% iter 400: loss 1.2965, time 875.17ms, mfu 16.77% iter 410: loss 1.3331, time 874.72ms, mfu 16.79% iter 420: loss 1.2910, time 876.10ms, mfu 16.81% iter 430: loss 1.2719, time 873.95ms, mfu 16.84% iter 440: loss 1.2871, time 874.31ms, mfu 16.86% iter 450: loss 1.2662, time 891.47ms, mfu 16.85% iter 460: loss 1.2239, time 888.73ms, mfu 16.84% iter 470: loss 1.2539, time 875.25ms, mfu 16.86% iter 480: loss 1.2353, time 873.99ms, mfu 16.88% iter 490: loss 1.2078, time 874.07ms, mfu 16.89% step 500: train loss 1.0880, val loss 1.4778 saving checkpoint to out-shakespeare-char-test

@otaviogood would you be kind to also put your training logs for train_shakespeare_char.py since you have 1x 4090: Ref . Trying to figure out what is going wrong.

Feb 24 '23 19:02 akjindal53244

It took me ~8 minutes to train to this point...

iter 0: loss 4.2648, time 9371.81ms, mfu -100.00% iter 10: loss 3.2202, time 893.76ms, mfu 16.68% iter 20: loss 2.7697, time 894.88ms, mfu 16.67% iter 30: loss 2.6104, time 893.50ms, mfu 16.68% iter 40: loss 2.5339, time 893.90ms, mfu 16.68% iter 50: loss 2.5026, time 894.09ms, mfu 16.67% iter 60: loss 2.4706, time 895.86ms, mfu 16.67% iter 70: loss 2.4851, time 894.67ms, mfu 16.67% iter 80: loss 2.4363, time 904.07ms, mfu 16.65% iter 90: loss 2.4392, time 893.38ms, mfu 16.65% iter 100: loss 2.4049, time 929.71ms, mfu 16.59% iter 110: loss 2.3950, time 892.77ms, mfu 16.60% iter 120: loss 2.3203, time 913.16ms, mfu 16.57% iter 130: loss 2.2558, time 894.64ms, mfu 16.58% iter 140: loss 2.2287, time 893.37ms, mfu 16.59% iter 150: loss 2.1278, time 895.55ms, mfu 16.60% iter 160: loss 2.0495, time 922.19ms, mfu 16.55% iter 170: loss 1.9908, time 893.89ms, mfu 16.57% iter 180: loss 1.9392, time 894.75ms, mfu 16.58% iter 190: loss 1.8510, time 894.19ms, mfu 16.59% iter 200: loss 1.8447, time 895.36ms, mfu 16.59% iter 210: loss 1.7943, time 893.15ms, mfu 16.60% iter 220: loss 1.7208, time 892.75ms, mfu 16.61% iter 230: loss 1.7159, time 894.08ms, mfu 16.62% iter 240: loss 1.6524, time 895.31ms, mfu 16.62% step 250: train loss 1.5440, val loss 1.7220 saving checkpoint to out-shakespeare-char iter 250: loss 1.6167, time 3694.46ms, mfu 15.36% iter 260: loss 1.6010, time 896.53ms, mfu 15.49% iter 270: loss 1.5830, time 895.19ms, mfu 15.60% iter 280: loss 1.5345, time 921.24ms, mfu 15.66% iter 290: loss 1.5143, time 898.22ms, mfu 15.75% iter 300: loss 1.5020, time 894.73ms, mfu 15.84% iter 310: loss 1.4904, time 895.30ms, mfu 15.93% iter 320: loss 1.4747, time 893.31ms, mfu 16.00% iter 330: loss 1.4784, time 952.10ms, mfu 15.97% iter 340: loss 1.4382, time 931.42ms, mfu 15.97% iter 350: loss 1.4192, time 896.70ms, mfu 16.04% iter 360: loss 1.3798, time 896.62ms, mfu 16.09% iter 370: loss 1.3605, time 894.52ms, mfu 16.15% iter 380: loss 1.3365, time 895.34ms, mfu 16.20% iter 390: loss 1.3092, time 895.64ms, mfu 16.24% iter 400: loss 1.2784, time 894.07ms, mfu 16.29% iter 410: loss 1.3261, time 894.70ms, mfu 16.32% iter 420: loss 1.2880, time 895.75ms, mfu 16.36% iter 430: loss 1.2595, time 894.65ms, mfu 16.39% iter 440: loss 1.2874, time 895.18ms, mfu 16.41% iter 450: loss 1.2703, time 893.15ms, mfu 16.44% iter 460: loss 1.2223, time 895.34ms, mfu 16.46% iter 470: loss 1.2556, time 895.08ms, mfu 16.48% iter 480: loss 1.2311, time 894.67ms, mfu 16.50% iter 490: loss 1.2055, time 922.89ms, mfu 16.46% step 500: train loss 1.0904, val loss 1.4732

My numbers line up with yours but it looks like there are maybe minor floating point differences. I might be running a weird Triton (compiler) setup that causes that. So yeah, 8 minutes is worse than Andrej's 3 minutes...

Feb 24 '23 19:02 otaviogood

I spun up a A100 on Lambda cloud and this is what I got. It took about 5.5 minutes this time to run it this far, so A100 is faster than my 4090... IDK why. But it's still not at Andrej's 3 minute time. Both of these are warning me that I'm not using flash attention. So maybe that?

iter 0: loss 4.2648, time 24493.23ms, mfu -100.00% iter 10: loss 3.2202, time 726.06ms, mfu 20.53% iter 20: loss 2.7697, time 725.95ms, mfu 20.53% iter 30: loss 2.6131, time 725.95ms, mfu 20.53% iter 40: loss 2.5413, time 726.42ms, mfu 20.53% iter 50: loss 2.5078, time 725.76ms, mfu 20.53% iter 60: loss 2.4698, time 725.56ms, mfu 20.53% iter 70: loss 2.4770, time 726.00ms, mfu 20.53% iter 80: loss 2.4388, time 728.99ms, mfu 20.52% iter 90: loss 2.4365, time 729.77ms, mfu 20.51% iter 100: loss 2.4013, time 730.23ms, mfu 20.50% iter 110: loss 2.3887, time 730.48ms, mfu 20.49% iter 120: loss 2.3284, time 730.33ms, mfu 20.48% iter 130: loss 2.2556, time 730.61ms, mfu 20.48% iter 140: loss 2.2216, time 731.40ms, mfu 20.47% iter 150: loss 2.1398, time 731.70ms, mfu 20.46% iter 160: loss 2.0580, time 731.72ms, mfu 20.45% iter 170: loss 1.9998, time 731.52ms, mfu 20.44% iter 180: loss 1.9351, time 732.22ms, mfu 20.43% iter 190: loss 1.8588, time 731.72ms, mfu 20.43% iter 200: loss 1.8503, time 731.54ms, mfu 20.42% iter 210: loss 1.7917, time 731.59ms, mfu 20.42% iter 220: loss 1.7247, time 731.96ms, mfu 20.41% iter 230: loss 1.7166, time 731.09ms, mfu 20.41% iter 240: loss 1.6551, time 732.31ms, mfu 20.40% step 250: train loss 1.5489, val loss 1.7311 saving checkpoint to out-shakespeare-char iter 250: loss 1.6182, time 3669.69ms, mfu 18.77% iter 260: loss 1.6092, time 731.56ms, mfu 18.93% iter 270: loss 1.5958, time 731.69ms, mfu 19.07% iter 280: loss 1.5405, time 731.67ms, mfu 19.20% iter 290: loss 1.5239, time 730.94ms, mfu 19.32% iter 300: loss 1.5078, time 732.09ms, mfu 19.43% iter 310: loss 1.4970, time 731.48ms, mfu 19.52% iter 320: loss 1.4900, time 732.09ms, mfu 19.60% iter 330: loss 1.4831, time 731.86ms, mfu 19.68% iter 340: loss 1.4370, time 731.61ms, mfu 19.75% iter 350: loss 1.4244, time 732.16ms, mfu 19.81% iter 360: loss 1.3776, time 731.57ms, mfu 19.87% iter 370: loss 1.3631, time 731.09ms, mfu 19.92% iter 380: loss 1.3416, time 732.23ms, mfu 19.96% iter 390: loss 1.3183, time 732.34ms, mfu 20.00% iter 400: loss 1.2901, time 732.21ms, mfu 20.04% iter 410: loss 1.3268, time 731.78ms, mfu 20.07% iter 420: loss 1.2918, time 731.93ms, mfu 20.10% iter 430: loss 1.2656, time 731.92ms, mfu 20.13% iter 440: loss 1.2857, time 731.84ms, mfu 20.15% iter 450: loss 1.2583, time 731.10ms, mfu 20.17% iter 460: loss 1.2156, time 731.98ms, mfu 20.19% iter 470: loss 1.2509, time 731.43ms, mfu 20.21% iter 480: loss 1.2320, time 730.58ms, mfu 20.23% iter 490: loss 1.2147, time 732.16ms, mfu 20.24% step 500: train loss 1.0905, val loss 1.4711 saving checkpoint to out-shakespeare-char

Feb 24 '23 20:02 otaviogood

Hm, good to know that the difference between the nightly and the 1.16 Pytorch really is that intense. Can you try rerunning your lambda lab run with compile set to False?

That would confirm it for sure.

Feb 24 '23 20:02 darien-schettler

@otaviogood is your 5.5 minutes in above logs measured on 1x A100 or 8x A100?

May be I misunderstood but Andrej's 3 mins training time was on 8x A100 and for 5000 iterations so that way we are far behind from what he got; correct me if I am wrong.

Feb 24 '23 20:02 akjindal53244

From the Readme:

On one A100 GPU this training run takes about 3 minutes and the best validation loss is 1.4697.

Feb 24 '23 20:02 darien-schettler

Sorry for typo I meant 1x A100. My primary question was around number of iterations since it is not mentioned in Readme.

@otaviogood I was using flash attention before and now if I turn it off, I don't see any drop in runtime; not sure why there is no change:

WARNING: using slow attention. Flash Attention atm needs PyTorch nightly and dropout=0.0 WARNING: using slow attention. Flash Attention atm needs PyTorch nightly and dropout=0.0 WARNING: using slow attention. Flash Attention atm needs PyTorch nightly and dropout=0.0 WARNING: using slow attention. Flash Attention atm needs PyTorch nightly and dropout=0.0 WARNING: using slow attention. Flash Attention atm needs PyTorch nightly and dropout=0.0 WARNING: using slow attention. Flash Attention atm needs PyTorch nightly and dropout=0.0 number of parameters: 10.65M using fused AdamW: True compiling the model... (takes a ~minute) step 0: train loss 4.2874, val loss 4.2823 iter 0: loss 4.2648, time 9506.15ms, mfu -100.00% iter 10: loss 3.2203, time 870.50ms, mfu 17.12% iter 20: loss 2.7697, time 869.85ms, mfu 17.12% iter 30: loss 2.6135, time 872.83ms, mfu 17.12% iter 40: loss 2.5417, time 880.42ms, mfu 17.10% iter 50: loss 2.5096, time 887.50ms, mfu 17.07% iter 60: loss 2.4740, time 871.22ms, mfu 17.07% iter 70: loss 2.4796, time 871.56ms, mfu 17.08% iter 80: loss 2.4337, time 871.96ms, mfu 17.08% iter 90: loss 2.4338, time 878.79ms, mfu 17.07% iter 100: loss 2.4038, time 876.84ms, mfu 17.06% iter 110: loss 2.3857, time 873.60ms, mfu 17.06% iter 120: loss 2.3132, time 885.35ms, mfu 17.04% iter 130: loss 2.2729, time 873.50ms, mfu 17.04% iter 140: loss 2.2230, time 875.74ms, mfu 17.04% iter 150: loss 2.1326, time 873.85ms, mfu 17.04% iter 160: loss 2.0576, time 877.37ms, mfu 17.03% iter 170: loss 2.0020, time 876.67ms, mfu 17.03% iter 180: loss 1.9470, time 911.25ms, mfu 16.96%

It would be good to check if you are seeing any improvement in runtime once you turn on flash attention.

Feb 24 '23 21:02 akjindal53244

Ok, so I ran 2 more tests on the Lambda cloud 1xA100.

I changed gradient_accumulation_steps = 1. That made the whole thing train to val loss 1.4856 in 1 min 40 seconds. *disclaimer - my training is non-deterministic for some reason, so that's +/- 0.02.
I reverted the entire change that I submitted and Andrej accepted. That made the whole thing train to val loss 1.4729 in approx 1 minute 20 seconds.

So my CL basically screws over all the people training Shakespeare. :/ Sorry about that. It's just a problem of different hyperparams working for different things. My CL was tuned to reproduce Andrej's training run of OpenWebText for the 124m param model. And I tested it thoroughly for that. And that was completely failing before for single GPU runs.

So maybe we need a different config param setup for the shakespeare people. For now, if you're running shakespeare, just revert my 2 line commit (which is the latest commit).

Feb 24 '23 22:02 otaviogood

@otaviogood it makes lot of sense now! After reverting your commit changes, I am able to run full training in 2.5 minutes on 1x 4090. Like you said, it is more about tuning hyper parameters for different dataset and there is no one common solution. Thanks for looking into this. It helped me understand the overall picture better.

I am curious since your commit was specifically designed for OWT dataset, could you post the hyperparameters for 1x 4090 config - right now it am seeing it taking 35 days for 600k steps. After how many steps were you able to reach loss ~2.9?

I originally asked this question here: https://github.com/karpathy/nanoGPT/issues/179 :)

Feb 25 '23 00:02 akjindal53244

So my CL basically screws over all the people training Shakespeare. :/ Sorry about that. [...] So maybe we need a different config param setup for the shakespeare people. For now, if you're running shakespeare, just revert my 2 line commit (which is the latest commit).

@otaviogood This makes sense. Would you want to make a PR fixing it perhaps? I imagine that config/train_shakespeare_char.py should set gradient_accumulation_steps = 1 and that train.py should have one more parameter, e.g. gradient_accumulation_steps_no_ddp_multiplier = 8 and it should be set to 1 in train_shakespeare_char.py. Right?

This makes an enormous difference in training time of Shakespeare on a single GPU.

Mar 23 '23 23:03 dkobak

Ok I finally made a PR to fix this. I tried to test it more thoroughly this time, but still left out testing of clusters. If anyone else wants to double check that my PR works, that would be helpful. I put lots of details of my training runs in the comments.

Mar 26 '23 06:03 otaviogood

nanoGPT nanoGPT copied to clipboard

Difference In Char-Shakespeare Training Time on A100 With Pytorch 1.16 vs. Nightly

nanoGPT
nanoGPT copied to clipboard