nanoGPT
nanoGPT copied to clipboard
What MFU score is to be expected?
Hello,
The training outputs the model flops utilization (MFU), which is quite low on my card (like 7-8%). Does anyone know what score is to be expected? I don't have an A100 readily available to test this.
Thanks!
Hi @yohan-pg , on 1x 4090 GPU, I am getting 29% MFU while training GPT-2 124M model. What is your GPU setup?
Thank you @akjindal53244 :) I am using a Quadro RTX 8000, but I am not running pytorch 2.0 (not my cards & cuda is too old). I will update this if I can get the cuda drivers updated.
From My finetuning experience, this model has always been 1~5%
So, depending on the GPU you have, you should edit the line defining flops_promised in estimate_mfu.
nanoGPT was trained on A100 GPUs, so the code compares the speed of training with the maximum possible speed on A100 GPUs. For other GPUs, this line will need to be edited for a good comparison.
Just an information adding to the above discussion. For RTX 4090 I found out that FLOPS promised = 82.58 TFLOPS, which I changed accordingly in model.py
That resulted in an MFU of over 64%.
Higher values for the batch size were positive for the MFU, but a single iteration step took longer.
The README says 4 days on 8xA100s, which is enough info to estimate the MFU. Copying code out of other parts of the repo:
n_layer = 12
n_head = 12
n_embd = 768
# these make the total batch size be ~0.5M
# 12 batch size * 1024 block size * 5 gradaccum * 8 GPUs = 491,520
batch_size = 12
block_size = 1024
gradient_accumulation_steps = 5 * 8
# this makes total number of tokens be 300B
max_iters = 600000
fwdbwd_per_iter = batch_size * gradient_accumulation_steps
N = 120e6
L, H, Q, T = n_layer, n_head, n_embd//n_head, block_size
flops_per_token = 6*N + 12*L*H*Q*T
flops_per_fwdbwd = flops_per_token * T
flops_per_iter = flops_per_fwdbwd * fwdbwd_per_iter
# express our flops throughput as ratio of A100 bfloat16 peak flops
# readme says 8xA100 for 4 days so
dt = (24*60*60*4) / max_iters
flops_achieved = flops_per_iter * (1.0/dt) # per second
flops_promised = 8 * 312e12 # 8xA100 GPU bfloat16 peak flops is 312 TFLOPS
mfu = flops_achieved / flops_promised
print(f"MFU achieved: {100.*mfu:.2f}%")
Results in: MFU achieved: 28.49%