nanoGPT
nanoGPT copied to clipboard
To reduce GPU memory usage & found a bug
I noticed that the GPU memory usage in"resume"mode is more than"scratch" so I checked the code and find this line:
I'm not sure if it can free up memory so I asked ChatGPT:
ChatGPT told me the way to solve the problem so I modify the "train.py":
I hope the developer can see my commend and fix the bug in the nonoGPT programm
then i discovered that it can't free up all the memory. i modify it to: import gc del state_dict del checkpoint gc.collect()
That Chatgpt response has legit ripped the words from https://saturncloud.io/blog/how-to-clear-gpu-memory-after-pytorch-model-training-without-restarting-kernel/, but you should make this a PR, and the del keyword doesnt need gc, should clear mem immediatly
@cooper-him I would also wrap the del
statements in a try-except, as checkpoint
may not be defined.
In any case, I tested the code (training a model with the same parameters as the smallest GPT2, i.e., 12 layers, 12 heads/layer, 678 embedding dimension and gpt2 tokenizer) and while I see the improvement by adding the piece of code to call the garbage collector, I still get about 300 MB of higher memory usage when resuming training from the checkpoint.
For context, I'm training on openwebtext with a batch size of 1 and 10 gradient accumulation steps (I was just playing with it since my own GPU only has 6 GB of VRAM 😂)
Here's what I added:
try:
del state_dict
del checkpoint
except:
pass
finally:
state_dict = None
checkpoint = None
torch.cuda.empty_cache()
gc.collect()