alpaca-lora icon indicating copy to clipboard operation
alpaca-lora copied to clipboard

CUDA out of memory

Open yangxuan14nlp opened this issue 1 year ago • 15 comments

我使用的是tesla T4 16g显卡,想微调一下7B的模型,每次到第一个epoch 第200个迭代时,就会报显卡内存错误,看起来是验证时导出模型文件内存不够了,但我看https://zhuanlan.zhihu.com/p/616504594中12g的RTX 4070微调是可以的,这是什么原因了,我的尝试为: 1.--micro_batch_size 1 没有用

Training Alpaca-LoRA model with params: base_model: decapoda-research/llama-7b-hf data_path: ./trans_chinese_alpaca_data.json output_dir: ./lora-alpaca-zh batch_size: 128 micro_batch_size: 2 num_epochs: 2 learning_rate: 0.0003 cutoff_len: 256 val_set_size: 2000 lora_r: 8 lora_alpha: 16 lora_dropout: 0.05 lora_target_modules: ['q_proj', 'v_proj'] train_on_inputs: True group_by_length: False wandb_project: wandb_run_name: wandb_watch: wandb_log_model: resume_from_checkpoint: False prompt template: alpaca

Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 33/33 [00:18<00:00, 1.79it/s] The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. The tokenizer class you load from this checkpoint is 'LLaMATokenizer'. The class this function is called from is 'LlamaTokenizer'. Found cached dataset json (/root/.cache/huggingface/datasets/json/default-d1370d3ed27da33a/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51) 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 428.82it/s] trainable params: 4194304 || all params: 6742609920 || trainable%: 0.06220594176090199 Loading cached split indices for dataset at /root/.cache/huggingface/datasets/json/default-d1370d3ed27da33a/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51/cache-baf974d16126c7f1.arrow and /root/.cache/huggingface/datasets/json/default-d1370d3ed27da33a/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51/cache-6013f18c705337f9.arrow {'loss': 2.2953, 'learning_rate': 2.9999999999999997e-05, 'epoch': 0.03} {'loss': 2.208, 'learning_rate': 5.9999999999999995e-05, 'epoch': 0.05} {'loss': 2.0048, 'learning_rate': 8.999999999999999e-05, 'epoch': 0.08} {'loss': 1.6192, 'learning_rate': 0.00011999999999999999, 'epoch': 0.1} {'loss': 1.381, 'learning_rate': 0.00015, 'epoch': 0.13} {'loss': 1.2977, 'learning_rate': 0.00017999999999999998, 'epoch': 0.15} {'loss': 1.2597, 'learning_rate': 0.00020999999999999998, 'epoch': 0.18} {'loss': 1.2318, 'learning_rate': 0.00023999999999999998, 'epoch': 0.21} {'loss': 1.2307, 'learning_rate': 0.00027, 'epoch': 0.23} {'loss': 1.2053, 'learning_rate': 0.0003, 'epoch': 0.26} {'loss': 1.1919, 'learning_rate': 0.0002955621301775148, 'epoch': 0.28} {'loss': 1.1657, 'learning_rate': 0.00029112426035502955, 'epoch': 0.31} {'loss': 1.1413, 'learning_rate': 0.00028668639053254437, 'epoch': 0.33} {'loss': 1.1372, 'learning_rate': 0.00028224852071005914, 'epoch': 0.36} {'loss': 1.1229, 'learning_rate': 0.00027781065088757395, 'epoch': 0.39} {'loss': 1.1173, 'learning_rate': 0.0002733727810650887, 'epoch': 0.41} {'loss': 1.1279, 'learning_rate': 0.00026893491124260353, 'epoch': 0.44} {'loss': 1.1182, 'learning_rate': 0.0002644970414201183, 'epoch': 0.46} {'loss': 1.112, 'learning_rate': 0.0002600591715976331, 'epoch': 0.49} {'loss': 1.0954, 'learning_rate': 0.00025562130177514793, 'epoch': 0.52} {'eval_loss': 1.1259599924087524, 'eval_runtime': 328.7811, 'eval_samples_per_second': 6.083, 'eval_steps_per_second': 0.76, 'epoch': 0.52} 26%|███████████████████████████████▏ | 200/776 [6:33:46<18:07:50, 113.32s/itTraceback (most recent call last): File "/new_data/yangxuan/alpaca-lora/finetune.py", line 276, in fire.Fire(train) File "/root/miniconda3/envs/python39/lib/python3.9/site-packages/fire/core.py", line 141, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/root/miniconda3/envs/python39/lib/python3.9/site-packages/fire/core.py", line 475, in _Fire component, remaining_args = _CallAndUpdateTrace( File "/root/miniconda3/envs/python39/lib/python3.9/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) File "/new_data/yangxuan/alpaca-lora/finetune.py", line 266, in train trainer.train(resume_from_checkpoint=resume_from_checkpoint) File "/root/miniconda3/envs/python39/lib/python3.9/site-packages/transformers/trainer.py", line 1662, in train return inner_training_loop( File "/root/miniconda3/envs/python39/lib/python3.9/site-packages/transformers/trainer.py", line 2006, in _inner_training_loop self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval) File "/root/miniconda3/envs/python39/lib/python3.9/site-packages/transformers/trainer.py", line 2291, in _maybe_log_save_evaluate self._save_checkpoint(model, trial, metrics=metrics) File "/root/miniconda3/envs/python39/lib/python3.9/site-packages/transformers/trainer.py", line 2348, in _save_checkpoint self.save_model(output_dir, _internal_call=True) File "/root/miniconda3/envs/python39/lib/python3.9/site-packages/transformers/trainer.py", line 2830, in save_model self._save(output_dir) File "/root/miniconda3/envs/python39/lib/python3.9/site-packages/transformers/trainer.py", line 2873, in _save state_dict = self.model.state_dict() File "/new_data/yangxuan/alpaca-lora/finetune.py", line 259, in self, old_state_dict() File "/root/miniconda3/envs/python39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1818, in state_dict module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars) File "/root/miniconda3/envs/python39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1818, in state_dict module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars) File "/root/miniconda3/envs/python39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1818, in state_dict module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars) [Previous line repeated 4 more times] File "/root/miniconda3/envs/python39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1815, in state_dict self._save_to_state_dict(destination, prefix, keep_vars) File "/root/miniconda3/envs/python39/lib/python3.9/site-packages/bitsandbytes/nn/modules.py", line 268, in _save_to_state_dict self.weight.data = undo_layout(self.state.CxB, self.state.tile_indices) File "/root/miniconda3/envs/python39/lib/python3.9/site-packages/bitsandbytes/autograd/_functions.py", line 96, in undo_layout outputs = torch.empty_like(tensor) # note: not using .index_copy because it was slower on cuda torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.00 MiB (GPU 0; 14.58 GiB total capacity; 13.37 GiB already allocated; 14.56 MiB free; 13.86 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

先谢!

yangxuan14nlp avatar Apr 17 '23 07:04 yangxuan14nlp

Same error: https://github.com/tloen/alpaca-lora/issues/344

It errors out at 200 iterations.

@tloen

lksysML avatar Apr 17 '23 07:04 lksysML

This seems to be realted to saving the model, my memory usage is aroung 16gb but when trainer trys to save the model or when model.save_pretrained is called the oom occures. So for some reason this line 'self.weight.data = undo_layout(self.state.CxB, self.state.tile_indices) trys to allocate more than additional 8gb of memory

KukumavMozolo avatar Apr 17 '23 07:04 KukumavMozolo

I was able to fix this issue by rolling back accelerate, peft, bitsandbytes and transformers to a commit dated around 5-6 april when my previous finetunes were successful. Didn't change any parameters and everything worked.

It's definitely an issue with one of these dependencies, need to pin point which one. Issue is not in PyTorch.

lksysML avatar Apr 17 '23 09:04 lksysML

I checked and bitsandbytes got bumped to 0.38.0 a few days ago, using bitsandbytes ==0.37.2 fixes it for me

KukumavMozolo avatar Apr 17 '23 09:04 KukumavMozolo

I checked and bitsandbytes got bumped to 0.38.0 a few days ago, using bitsandbytes ==0.37.2 fixes it for me

Super!

lksysML avatar Apr 17 '23 09:04 lksysML

thks, it is useful.

lksysML @.***> 于2023年4月17日周一 17:52写道:

I checked and bitsandbytes got bumped to 0.38.0 a few days ago, using bitsandbytes ==0.37.2 fixes it for me

Super!

— Reply to this email directly, view it on GitHub https://github.com/tloen/alpaca-lora/issues/350#issuecomment-1511038852, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASCDHGYXIHNKRJWXP6CU46DXBUHGFANCNFSM6AAAAAAXAYA7NM . You are receiving this because you authored the thread.Message ID: @.***>

yangxuan14nlp avatar Apr 18 '23 09:04 yangxuan14nlp

为什么我3090 24g,跑llama-7b就报CUDA out of memory了??又试了下两张3090还是同样的错误 model = LlamaForCausalLM.from_pretrained( 在这步加载模型的时候就报了: RuntimeError: CUDA error: out of memory 这是我的参数设置: Training Alpaca-LoRA model with params: base_model: ../LLaMA-7B data_path: ./instruction_data.json output_dir: ./lora-alpaca batch_size: 24 micro_batch_size: 1 num_epochs: 3 learning_rate: 0.0003 cutoff_len: 400 val_set_size: 2000 lora_r: 8 lora_alpha: 16 lora_dropout: 0.05 lora_target_modules: ['q_proj', 'v_proj'] train_on_inputs: True add_eos_token: False group_by_length: False wandb_project: wandb_run_name: wandb_watch: wandb_log_model: resume_from_checkpoint: False prompt template: alpaca_short

Stark-zheng avatar Apr 26 '23 01:04 Stark-zheng

I tried peft==0.2.0, bitsandbytes ==0.37.2. But it still run out of memory when validate at second times. 7B model on 24G VRAM

luxuriance19 avatar Apr 26 '23 11:04 luxuriance19

为什么我3090 24g,跑llama-7b就报CUDA out of memory了??又试了下两张3090还是同样的错误 model = LlamaForCausalLM.from_pretrained( 在这步加载模型的时候就报了: RuntimeError: CUDA error: out of memory 这是我的参数设置: Training Alpaca-LoRA model with params: base_model: ../LLaMA-7B data_path: ./instruction_data.json output_dir: ./lora-alpaca batch_size: 24 micro_batch_size: 1 num_epochs: 3 learning_rate: 0.0003 cutoff_len: 400 val_set_size: 2000 lora_r: 8 lora_alpha: 16 lora_dropout: 0.05 lora_target_modules: ['q_proj', 'v_proj'] train_on_inputs: True add_eos_token: False group_by_length: False wandb_project: wandb_run_name: wandb_watch: wandb_log_model: resume_from_checkpoint: False prompt template: alpaca_short

我和你一样的

zh25714 avatar Apr 29 '23 12:04 zh25714

happening for me right now on latest transformers and bnb 0.37.2..

teknium1 avatar Apr 30 '23 10:04 teknium1

same issue. tried reverting versions to no avail. currently on 64gb vram

freelerobot avatar May 04 '23 02:05 freelerobot

anybody solved this problem?

luxuriance19 avatar May 06 '23 02:05 luxuriance19

can anyone try peft 0.2.0 like @cnbeining change in his repo referencing this issue

teknium1 avatar May 06 '23 06:05 teknium1

using bitsandbytes ==0.37.2

if u get 'undefined symbol: cget_col_row_stats' when doing this step, try the following

cp libbitsandbytes_cuda117.so libbitsandbytes_cpu.so

jasonvanf avatar May 06 '23 07:05 jasonvanf

I checked and bitsandbytes got bumped to 0.38.0 a few days ago, using bitsandbytes ==0.37.2 fixes it for me

Super!

Worked for me!

afnanhabib787 avatar Jun 05 '23 09:06 afnanhabib787