nanoGPT icon indicating copy to clipboard operation
nanoGPT copied to clipboard

MFU too low in custom GPT-2 training

Open eonurk opened this issue 3 months ago • 1 comments

Hi all,

Thanks @karpathy for this and the lectures are also awesome!

I am training a GPT-2 model, the loss is decreasing and everything looks fine except my MFU values are really low. So I guess the network is not optimized for the resources I have.

[...] iter 340: loss 5.4983, time 33785.52ms, mfu 0.96% iter 350: loss 5.4702, time 33798.82ms, mfu 0.96% iter 360: loss 5.0448, time 33770.99ms, mfu 0.96% iter 370: loss 5.3350, time 33695.61ms, mfu 0.96% iter 380: loss 5.7851, time 33648.24ms, mfu 0.96% iter 390: loss 4.6022, time 33637.29ms, mfu 0.96%

I am training with 2x TITAN Xp, and I played with the config parameters and train.py a little bit based on recommendations I saw on search to make things work:

config:

batch_size = 4 block_size = 512 gradient_accumulation_steps = 64* 2

added to train.py to make it work on TITAN Xp:

# Suppress Errors and Fall Back to Eager Execution
import torch._dynamo
torch._dynamo.config.suppress_errors = True

# Adjust Torch's Memory Allocator Configuration
os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'expandable_segments:True'

Ran this on cluster with: torchrun --standalone --nproc_per_node=2 train.py config/train_gpt2.py

Summary statistics for the data:

vocab size: 44,902 train has 47,316,670 tokens val has 2,490,352 tokens

Other statistics:

tokens per iteration will be: 262,144 found vocab_size = 44902 (inside data/pathway/meta.pkl) Initializing a new model from scratch tokens per iteration will be: 262,144 found vocab_size = 44902 (inside data/pathway/meta.pkl) Initializing a new model from scratch number of parameters: 119.44M number of parameters: 119.44M num decayed parameter tensors: 50, with 119,812,608 parameters num non-decayed parameter tensors: 25, with 19,200 parameters using fused AdamW: True

I feel like I am wasting quite a bit of flops and time. Anyone with any recommendations?

Thank you!

eonurk avatar Mar 25 '24 13:03 eonurk

I found this forum post about batch size and gradient accumulation. Indeed, changing the settings my run time was much better with almost the same drop in loss (although backprop is calculated with fewer batch) but MFU is still low :)

batch_size = 16 block_size = 512 gradient_accumulation_steps = 1 * 2

[...] 1472 iter 40: loss 10.6144, time 1995.54ms, mfu 1.04% 1473 iter 50: loss 10.5928, time 1947.97ms, mfu 1.04% 1474 iter 60: loss 10.5358, time 1983.14ms, mfu 1.04% 1475 iter 70: loss 10.4535, time 1945.77ms, mfu 1.04% 1476 iter 80: loss 10.3011, time 1990.12ms, mfu 1.04% 1477 iter 90: loss 10.1907, time 1949.02ms, mfu 1.04% 1478 iter 100: loss 10.2009, time 2414.40ms, mfu 1.02% 1479 iter 110: loss 9.9381, time 1969.89ms, mfu 1.02%

eonurk avatar Mar 26 '24 10:03 eonurk

Apperantly MFU calculation is hard coded for A100 GPUs, so the metric does not reflect the reality in my case. Therefore, closing this issue.

eonurk avatar Apr 19 '24 12:04 eonurk