nanoGPT icon indicating copy to clipboard operation
nanoGPT copied to clipboard

To reduce GPU memory usage & found a bug

Open cooper-him opened this issue 1 year ago • 3 comments

I noticed that the GPU memory usage in"resume"mode is more than"scratch" so I checked the code and find this line: old train.py I'm not sure if it can free up memory so I asked ChatGPT: ChatGPT's reply ChatGPT told me the way to solve the problem so I modify the "train.py": Desktop Screenshot 2024 02 18 - 00 40 33 27 I hope the developer can see my commend and fix the bug in the nonoGPT programm

cooper-him avatar Feb 17 '24 16:02 cooper-him

then i discovered that it can't free up all the memory. i modify it to: import gc del state_dict del checkpoint gc.collect()

cooper-him avatar Feb 18 '24 03:02 cooper-him

That Chatgpt response has legit ripped the words from https://saturncloud.io/blog/how-to-clear-gpu-memory-after-pytorch-model-training-without-restarting-kernel/, but you should make this a PR, and the del keyword doesnt need gc, should clear mem immediatly

VatsaDev avatar Feb 19 '24 00:02 VatsaDev

@cooper-him I would also wrap the del statements in a try-except, as checkpoint may not be defined.

In any case, I tested the code (training a model with the same parameters as the smallest GPT2, i.e., 12 layers, 12 heads/layer, 678 embedding dimension and gpt2 tokenizer) and while I see the improvement by adding the piece of code to call the garbage collector, I still get about 300 MB of higher memory usage when resuming training from the checkpoint.

For context, I'm training on openwebtext with a batch size of 1 and 10 gradient accumulation steps (I was just playing with it since my own GPU only has 6 GB of VRAM 😂)

Here's what I added:

try:
    del state_dict
    del checkpoint
except:
    pass
finally:
    state_dict = None
    checkpoint = None
    torch.cuda.empty_cache()
    gc.collect()

davmacario avatar Mar 14 '24 00:03 davmacario