Hi, I faced an out-of-memory issue fine-tuning RedPajama-INCITE-7B-Base on Alpaca data with 1GPU g5.16xlarge with 24 GPU memory (GiB). With adapter_v2.py, I changed the learning_rate = 3e-3 and micro_batch_size = 1. The model fine-tune works really well in the beginning and run into out of memory issue after 65498 iterations. Any one knows how to solve it? Thanks!
iter 65496: loss 1.2029, time: 101.54ms
iter 65497: loss 1.5817, time: 184.24ms
iter 65498: loss 1.4716, time: 101.98ms
Traceback (most recent call last):
File "/home/ec2-user/SageMaker/lit-parrot/finetune/adapter_v2.py", line 281, in
CLI(setup)
File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/jsonargparse/cli.py", line 85, in CLI
return _run_component(component, cfg_init)
File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/jsonargparse/cli.py", line 147, in _run_component
return component(**cfg)
File "/home/ec2-user/SageMaker/lit-parrot/finetune/adapter_v2.py", line 71, in setup
fabric.launch(main, data_dir, checkpoint_dir, out_dir)
File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 732,in launch
return self._wrap_and_launch(function, self, *args, **kwargs)
File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 814,in _wrap_and_launch
return to_run(*args, **kwargs)
File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 823,in _wrap_with_setup
return to_run(*args, **kwargs)
File "/home/ec2-user/SageMaker/lit-parrot/finetune/adapter_v2.py", line 105, in main
train(fabric, model, optimizer, train_data, val_data, checkpoint_dir, out_dir)
File "/home/ec2-user/SageMaker/lit-parrot/finetune/adapter_v2.py", line 148, in train
fabric.backward(loss / gradient_accumulation_iters)
File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 387,in backward
self._strategy.backward(tensor, module, *args, **kwargs)
File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/lightning/fabric/strategies/strategy.py", line 179, in backward
self.precision.backward(tensor, module, *args, **kwargs)
File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/lightning/fabric/plugins/precision/precision.py", line 89, in backward
tensor.backward(*args, **kwargs)
File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/torch/_tensor.py", line 491, in backward
torch.autograd.backward(
File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/torch/autograd/init.py", line 204,in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 200.00 MiB. GPU 0 has a total capacty of 22.19 GiB of which 106.50 MiB is free. Including non-PyTorch memory, this process has 22.08 GiB memory in use. Of the allocated memory 20.42 GiB is allocated by PyTorch, and 1.36 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory islarge try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Do you know if memory was slowly increasing with the iteration count? or was there just one spike that pushed you over the limit?
I just merged #143 which should reduce the memory usage of the fine-tuning script. You'll need to run scripts/prepare_alpaca.py again
I just merged some improvements to reduce the peak memory usage. Please pull the latest changes.
I'll also be adding a guide for dealing with OOMs with #182. Hope this helps