starcoder
starcoder copied to clipboard
torch.cuda.OutOfMemoryError: CUDA out of memory When Trying to Save the Model
Howdy!
I am using the finetune/finetune.py
script. It trains on NVIDIA A40, and at the end when it tries to save the model/checkpoints it raises the torch.cuda.OutOfMemoryError: CUDA out of memory
error.
Here is a full traceback:
Traceback (most recent call last):
File "/scratch/user/seyyedaliayati/auto-test-gpt/finetune.py", line 336, in <module>
main(args)
File "/scratch/user/seyyedaliayati/auto-test-gpt/finetune.py", line 325, in main
run_training(args, train_dataset, eval_dataset)
File "/scratch/user/seyyedaliayati/auto-test-gpt/finetune.py", line 313, in run_training
trainer.train()
File "/scratch/user/seyyedaliayati/.conda/envs/env/lib/python3.10/site-packages/transformers/trainer.py", line 1664, in train
return inner_training_loop(
File "/scratch/user/seyyedaliayati/.conda/envs/env/lib/python3.10/site-packages/transformers/trainer.py", line 2019, in _inner_training_loop
self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
File "/scratch/user/seyyedaliayati/.conda/envs/env/lib/python3.10/site-packages/transformers/trainer.py", line 2308, in _maybe_log_save_evaluate
self._save_checkpoint(model, trial, metrics=metrics)
File "/scratch/user/seyyedaliayati/.conda/envs/env/lib/python3.10/site-packages/transformers/trainer.py", line 2365, in _save_checkpoint
self.save_model(output_dir, _internal_call=True)
File "/scratch/user/seyyedaliayati/.conda/envs/env/lib/python3.10/site-packages/transformers/trainer.py", line 2866, in save_model
self._save(output_dir)
File "/scratch/user/seyyedaliayati/.conda/envs/env/lib/python3.10/site-packages/transformers/trainer.py", line 2909, in _save
state_dict = self.model.state_dict()
File "/scratch/user/seyyedaliayati/.conda/envs/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1448, in state_dict
module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars)
File "/scratch/user/seyyedaliayati/.conda/envs/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1448, in state_dict
module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars)
File "/scratch/user/seyyedaliayati/.conda/envs/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1448, in state_dict
module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars)
[Previous line repeated 4 more times]
File "/scratch/user/seyyedaliayati/.conda/envs/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1445, in state_dict
self._save_to_state_dict(destination, prefix, keep_vars)
File "/scratch/user/seyyedaliayati/.conda/envs/env/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 268, in _save_to_state_dict
self.weight.data = undo_layout(self.state.CxB, self.state.tile_indices)
File "/scratch/user/seyyedaliayati/.conda/envs/env/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 96, in undo_layout
outputs = torch.empty_like(tensor) # note: not using .index_copy because it was slower on cuda
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 144.00 MiB (GPU 0; 47.38 GiB total capacity; 44.56 GiB already allocated; 109.19 MiB free; 46.17 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Any ideas what's happening and how to solve this issue?
https://github.com/MHketbi/starcoder
try my fork
same error here with A40.
https://github.com/MHketbi/starcoder
try my fork
Worked on NVIDIA A100 80 GB, but not on NVIDIA A40 48 GB
model.gradient_checkpointing_enable()
Got this to run on NVIDIA A100-SXM4-40GB thanks to @MHketbi
After changing device_map={"": Accelerator().process_index}
to device_map='auto'
the checkpoints saved without any issues. Accelerator().process_index
was returning 0
which I guess was causing it to stay on the GPU and not let Accelerate do its magic....
model.gradient_checkpointing_enable
Does this help you personally? If yes, what GPU were you able to use? What were you trying to do? Finetune / Train from scratch.
Hi @esko22 - I tried to make the following change - device_map='auto' , however, I am still getting the same error. I am using NVIDIA A100-SXM4-40GB. Are you running the fine-tuning on multi-gpu ?
Traceback (most recent call last): File "finetune/finetune.py", line 408, in <module> main(args) File "finetune/finetune.py", line 401, in main run_training(args, train_dataset, eval_dataset) File "finetune/finetune.py", line 391, in run_training trainer.train() File "/raid/ansysai/ruchaa/projects/pymapdlAI/pymapdlft/lib/python3.8/site-packages/transformers/trainer.py", line 1539, in train return inner_training_loop( File "/raid/ansysai/ruchaa/projects/pymapdlAI/pymapdlft/lib/python3.8/site-packages/transformers/trainer.py", line 1883, in _inner_training_loop self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval) File "/raid/ansysai/ruchaa/projects/pymapdlAI/pymapdlft/lib/python3.8/site-packages/transformers/trainer.py", line 2195, in _maybe_log_save_evaluate self._save_checkpoint(model, trial, metrics=metrics) File "/raid/ansysai/ruchaa/projects/pymapdlAI/pymapdlft/lib/python3.8/site-packages/transformers/trainer.py", line 2252, in _save_checkpoint self.save_model(output_dir, _internal_call=True) File "/raid/ansysai/ruchaa/projects/pymapdlAI/pymapdlft/lib/python3.8/site-packages/transformers/trainer.py", line 2765, in save_model self._save(output_dir) File "/raid/ansysai/ruchaa/projects/pymapdlAI/pymapdlft/lib/python3.8/site-packages/transformers/trainer.py", line 2823, in _save self.model.save_pretrained( File "/raid/ansysai/ruchaa/projects/pymapdlAI/pymapdlft/lib/python3.8/site-packages/peft/peft_model.py", line 135, in save_pretrained output_state_dict = get_peft_model_state_dict( File "/raid/ansysai/ruchaa/projects/pymapdlAI/pymapdlft/lib/python3.8/site-packages/peft/utils/save_and_load.py", line 32, in get_peft_model_state_dict state_dict = model.state_dict() File "/raid/ansysai/ruchaa/projects/pymapdlAI/pymapdlft/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1818, in state_dict module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars) File "/raid/ansysai/ruchaa/projects/pymapdlAI/pymapdlft/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1818, in state_dict module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars) File "/raid/ansysai/ruchaa/projects/pymapdlAI/pymapdlft/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1818, in state_dict module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars) [Previous line repeated 4 more times] File "/raid/ansysai/ruchaa/projects/pymapdlAI/pymapdlft/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1815, in state_dict self._save_to_state_dict(destination, prefix, keep_vars) File "/raid/ansysai/ruchaa/projects/pymapdlAI/pymapdlft/lib/python3.8/site-packages/bitsandbytes/nn/modules.py", line 336, in _save_to_state_dict self.weight.data = undo_layout(self.state.CxB, self.state.tile_indices) File "/raid/ansysai/ruchaa/projects/pymapdlAI/pymapdlft/lib/python3.8/site-packages/bitsandbytes/autograd/_functions.py", line 100, in undo_layout return outputs.reshape(rows, cols).contiguous() torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 144.00 MiB (GPU 0; 39.59 GiB total capacity; 36.59 GiB already allocated; 88.19 MiB free; 38.31 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
No - single GPU A100 on Colab
No - single GPU A100 on Colab
@esko22 - Thank you for your reply. Were you using a token size of 1024 ?