nanoGPT High Loss Value When Training NanoGPT on a Single Small GPU

High Loss Value When Training NanoGPT on a Single Small GPU

Open darcys22 opened this issue 1 year ago • 2 comments

Hello,

I'm working with the NanoGPT train.py script, following the "reproducing GPT-2" instructions, aiming to replicate the GPT-2 model's training using OpenWebText data. Unlike the original model, which was trained on 4x A100 GPUs, I used a single, small GPU. Initially, I started with a batch size of 12, but due to the GPU running out of memory, I reduced it to 3. The only change i made to train.py being this

batch_size = 3 # if gradient_accumulation_steps > 1, this is the micro-batch size

When calling python train.py the training loss plateaued at 7.5. To be specific it reached this loss after 2 days stayed at that level until I cancelled the run after 4 days, much higher than the expected 2.8. This leads me to a few questions:

Is the high loss primarily due to the limited capacity of my smaller GPU?
Does the reduction in batch size impact the learning efficacy of the model, beyond just slowing down the training?

Any advice on training NanoGPT effectively on limited hardware would be greatly appreciated, along with suggestions for any configuration adjustments.

Thank you for your time and insights.

Jan 01 '24 10:01 darcys22

What GPU? Also are you really the 124M model? that wouldn't train on a "single small GPU"

Jan 05 '24 04:01 VatsaDev

n_layer = 12
n_head = 12
n_embd = 768

Yeah that absolutely is going to be the reason, appreciate the response!

Jan 10 '24 20:01 darcys22

nanoGPT nanoGPT copied to clipboard

High Loss Value When Training NanoGPT on a Single Small GPU

nanoGPT
nanoGPT copied to clipboard