nanoGPT icon indicating copy to clipboard operation
nanoGPT copied to clipboard

What MFU score is to be expected?

Open yohan-pg opened this issue 1 year ago • 6 comments

Hello,

The training outputs the model flops utilization (MFU), which is quite low on my card (like 7-8%). Does anyone know what score is to be expected? I don't have an A100 readily available to test this.

Thanks!

yohan-pg avatar Feb 28 '23 14:02 yohan-pg

Hi @yohan-pg , on 1x 4090 GPU, I am getting 29% MFU while training GPT-2 124M model. What is your GPU setup?

akjindal53244 avatar Mar 02 '23 23:03 akjindal53244

Thank you @akjindal53244 :) I am using a Quadro RTX 8000, but I am not running pytorch 2.0 (not my cards & cuda is too old). I will update this if I can get the cuda drivers updated.

yohan-pg avatar Mar 03 '23 22:03 yohan-pg

From My finetuning experience, this model has always been 1~5%

VatsaDev avatar Aug 23 '23 21:08 VatsaDev

So, depending on the GPU you have, you should edit the line defining flops_promised in estimate_mfu.

nanoGPT was trained on A100 GPUs, so the code compares the speed of training with the maximum possible speed on A100 GPUs. For other GPUs, this line will need to be edited for a good comparison.

shehper avatar Sep 25 '23 13:09 shehper

Just an information adding to the above discussion. For RTX 4090 I found out that FLOPS promised = 82.58 TFLOPS, which I changed accordingly in model.py

That resulted in an MFU of over 64%.

Higher values for the batch size were positive for the MFU, but a single iteration step took longer.

drdsgvo avatar Feb 14 '24 13:02 drdsgvo

The README says 4 days on 8xA100s, which is enough info to estimate the MFU. Copying code out of other parts of the repo:

n_layer = 12
n_head = 12
n_embd = 768

# these make the total batch size be ~0.5M
# 12 batch size * 1024 block size * 5 gradaccum * 8 GPUs = 491,520
batch_size = 12
block_size = 1024
gradient_accumulation_steps = 5 * 8

# this makes total number of tokens be 300B
max_iters = 600000

fwdbwd_per_iter = batch_size * gradient_accumulation_steps

N = 120e6
L, H, Q, T = n_layer, n_head, n_embd//n_head, block_size
flops_per_token = 6*N + 12*L*H*Q*T
flops_per_fwdbwd = flops_per_token * T
flops_per_iter = flops_per_fwdbwd * fwdbwd_per_iter
# express our flops throughput as ratio of A100 bfloat16 peak flops
# readme says 8xA100 for 4 days so
dt = (24*60*60*4) / max_iters
flops_achieved = flops_per_iter * (1.0/dt) # per second
flops_promised = 8 * 312e12 # 8xA100 GPU bfloat16 peak flops is 312 TFLOPS
mfu = flops_achieved / flops_promised

print(f"MFU achieved: {100.*mfu:.2f}%")

Results in: MFU achieved: 28.49%

gngdb avatar May 04 '24 01:05 gngdb