litgpt Out of memory issue for fine-tuning RedPajama-INCITE-7B-Base with 1 GPU

Hi, I faced an out-of-memory issue fine-tuning RedPajama-INCITE-7B-Base on Alpaca data with 1GPU g5.16xlarge with 24 GPU memory (GiB). With adapter_v2.py, I changed the learning_rate = 3e-3 and micro_batch_size = 1. The model fine-tune works really well in the beginning and run into out of memory issue after 65498 iterations. Any one knows how to solve it? Thanks!

iter 65496: loss 1.2029, time: 101.54ms iter 65497: loss 1.5817, time: 184.24ms iter 65498: loss 1.4716, time: 101.98ms Traceback (most recent call last): File "/home/ec2-user/SageMaker/lit-parrot/finetune/adapter_v2.py", line 281, in CLI(setup) File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/jsonargparse/cli.py", line 85, in CLI return _run_component(component, cfg_init) File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/jsonargparse/cli.py", line 147, in _run_component return component(**cfg) File "/home/ec2-user/SageMaker/lit-parrot/finetune/adapter_v2.py", line 71, in setup fabric.launch(main, data_dir, checkpoint_dir, out_dir) File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 732,in launch return self._wrap_and_launch(function, self, *args, **kwargs) File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 814,in _wrap_and_launch return to_run(*args, **kwargs) File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 823,in _wrap_with_setup return to_run(*args, **kwargs) File "/home/ec2-user/SageMaker/lit-parrot/finetune/adapter_v2.py", line 105, in main train(fabric, model, optimizer, train_data, val_data, checkpoint_dir, out_dir) File "/home/ec2-user/SageMaker/lit-parrot/finetune/adapter_v2.py", line 148, in train fabric.backward(loss / gradient_accumulation_iters) File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 387,in backward self._strategy.backward(tensor, module, *args, **kwargs) File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/lightning/fabric/strategies/strategy.py", line 179, in backward self.precision.backward(tensor, module, *args, **kwargs) File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/lightning/fabric/plugins/precision/precision.py", line 89, in backward tensor.backward(*args, **kwargs) File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/torch/_tensor.py", line 491, in backward torch.autograd.backward( File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/torch/autograd/init.py", line 204,in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 200.00 MiB. GPU 0 has a total capacty of 22.19 GiB of which 106.50 MiB is free. Including non-PyTorch memory, this process has 22.08 GiB memory in use. Of the allocated memory 20.42 GiB is allocated by PyTorch, and 1.36 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory islarge try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Jun 14 '23 20:06 xy990

Do you know if memory was slowly increasing with the iteration count? or was there just one spike that pushed you over the limit?

I just merged #143 which should reduce the memory usage of the fine-tuning script. You'll need to run scripts/prepare_alpaca.py again

Jun 15 '23 01:06 carmocca

I just merged some improvements to reduce the peak memory usage. Please pull the latest changes.

I'll also be adding a guide for dealing with OOMs with #182. Hope this helps

Jun 21 '23 16:06 carmocca