alpaca-lora CUDA out of memory

我使用的是tesla T4 16g显卡，想微调一下7B的模型，每次到第一个epoch 第200个迭代时，就会报显卡内存错误，看起来是验证时导出模型文件内存不够了，但我看https://zhuanlan.zhihu.com/p/616504594中12g的RTX 4070微调是可以的，这是什么原因了，我的尝试为： 1.--micro_batch_size 1 没有用

Training Alpaca-LoRA model with params: base_model: decapoda-research/llama-7b-hf data_path: ./trans_chinese_alpaca_data.json output_dir: ./lora-alpaca-zh batch_size: 128 micro_batch_size: 2 num_epochs: 2 learning_rate: 0.0003 cutoff_len: 256 val_set_size: 2000 lora_r: 8 lora_alpha: 16 lora_dropout: 0.05 lora_target_modules: ['q_proj', 'v_proj'] train_on_inputs: True group_by_length: False wandb_project: wandb_run_name: wandb_watch: wandb_log_model: resume_from_checkpoint: False prompt template: alpaca

Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 33/33 [00:18<00:00, 1.79it/s] The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. The tokenizer class you load from this checkpoint is 'LLaMATokenizer'. The class this function is called from is 'LlamaTokenizer'. Found cached dataset json (/root/.cache/huggingface/datasets/json/default-d1370d3ed27da33a/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51) 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 428.82it/s] trainable params: 4194304 || all params: 6742609920 || trainable%: 0.06220594176090199 Loading cached split indices for dataset at /root/.cache/huggingface/datasets/json/default-d1370d3ed27da33a/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51/cache-baf974d16126c7f1.arrow and /root/.cache/huggingface/datasets/json/default-d1370d3ed27da33a/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51/cache-6013f18c705337f9.arrow {'loss': 2.2953, 'learning_rate': 2.9999999999999997e-05, 'epoch': 0.03} {'loss': 2.208, 'learning_rate': 5.9999999999999995e-05, 'epoch': 0.05} {'loss': 2.0048, 'learning_rate': 8.999999999999999e-05, 'epoch': 0.08} {'loss': 1.6192, 'learning_rate': 0.00011999999999999999, 'epoch': 0.1} {'loss': 1.381, 'learning_rate': 0.00015, 'epoch': 0.13} {'loss': 1.2977, 'learning_rate': 0.00017999999999999998, 'epoch': 0.15} {'loss': 1.2597, 'learning_rate': 0.00020999999999999998, 'epoch': 0.18} {'loss': 1.2318, 'learning_rate': 0.00023999999999999998, 'epoch': 0.21} {'loss': 1.2307, 'learning_rate': 0.00027, 'epoch': 0.23} {'loss': 1.2053, 'learning_rate': 0.0003, 'epoch': 0.26} {'loss': 1.1919, 'learning_rate': 0.0002955621301775148, 'epoch': 0.28} {'loss': 1.1657, 'learning_rate': 0.00029112426035502955, 'epoch': 0.31} {'loss': 1.1413, 'learning_rate': 0.00028668639053254437, 'epoch': 0.33} {'loss': 1.1372, 'learning_rate': 0.00028224852071005914, 'epoch': 0.36} {'loss': 1.1229, 'learning_rate': 0.00027781065088757395, 'epoch': 0.39} {'loss': 1.1173, 'learning_rate': 0.0002733727810650887, 'epoch': 0.41} {'loss': 1.1279, 'learning_rate': 0.00026893491124260353, 'epoch': 0.44} {'loss': 1.1182, 'learning_rate': 0.0002644970414201183, 'epoch': 0.46} {'loss': 1.112, 'learning_rate': 0.0002600591715976331, 'epoch': 0.49} {'loss': 1.0954, 'learning_rate': 0.00025562130177514793, 'epoch': 0.52} {'eval_loss': 1.1259599924087524, 'eval_runtime': 328.7811, 'eval_samples_per_second': 6.083, 'eval_steps_per_second': 0.76, 'epoch': 0.52} 26%|███████████████████████████████▏ | 200/776 [6:33:46<18:07:50, 113.32s/itTraceback (most recent call last): File "/new_data/yangxuan/alpaca-lora/finetune.py", line 276, in fire.Fire(train) File "/root/miniconda3/envs/python39/lib/python3.9/site-packages/fire/core.py", line 141, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/root/miniconda3/envs/python39/lib/python3.9/site-packages/fire/core.py", line 475, in _Fire component, remaining_args = _CallAndUpdateTrace( File "/root/miniconda3/envs/python39/lib/python3.9/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) File "/new_data/yangxuan/alpaca-lora/finetune.py", line 266, in train trainer.train(resume_from_checkpoint=resume_from_checkpoint) File "/root/miniconda3/envs/python39/lib/python3.9/site-packages/transformers/trainer.py", line 1662, in train return inner_training_loop( File "/root/miniconda3/envs/python39/lib/python3.9/site-packages/transformers/trainer.py", line 2006, in _inner_training_loop self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval) File "/root/miniconda3/envs/python39/lib/python3.9/site-packages/transformers/trainer.py", line 2291, in _maybe_log_save_evaluate self._save_checkpoint(model, trial, metrics=metrics) File "/root/miniconda3/envs/python39/lib/python3.9/site-packages/transformers/trainer.py", line 2348, in _save_checkpoint self.save_model(output_dir, _internal_call=True) File "/root/miniconda3/envs/python39/lib/python3.9/site-packages/transformers/trainer.py", line 2830, in save_model self._save(output_dir) File "/root/miniconda3/envs/python39/lib/python3.9/site-packages/transformers/trainer.py", line 2873, in _save state_dict = self.model.state_dict() File "/new_data/yangxuan/alpaca-lora/finetune.py", line 259, in self, old_state_dict() File "/root/miniconda3/envs/python39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1818, in state_dict module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars) File "/root/miniconda3/envs/python39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1818, in state_dict module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars) File "/root/miniconda3/envs/python39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1818, in state_dict module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars) [Previous line repeated 4 more times] File "/root/miniconda3/envs/python39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1815, in state_dict self._save_to_state_dict(destination, prefix, keep_vars) File "/root/miniconda3/envs/python39/lib/python3.9/site-packages/bitsandbytes/nn/modules.py", line 268, in _save_to_state_dict self.weight.data = undo_layout(self.state.CxB, self.state.tile_indices) File "/root/miniconda3/envs/python39/lib/python3.9/site-packages/bitsandbytes/autograd/_functions.py", line 96, in undo_layout outputs = torch.empty_like(tensor) # note: not using .index_copy because it was slower on cuda torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.00 MiB (GPU 0; 14.58 GiB total capacity; 13.37 GiB already allocated; 14.56 MiB free; 13.86 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

先谢！

Apr 17 '23 07:04 yangxuan14nlp

Same error: https://github.com/tloen/alpaca-lora/issues/344

It errors out at 200 iterations.

@tloen

Apr 17 '23 07:04 lksysML

This seems to be realted to saving the model, my memory usage is aroung 16gb but when trainer trys to save the model or when model.save_pretrained is called the oom occures. So for some reason this line 'self.weight.data = undo_layout(self.state.CxB, self.state.tile_indices) trys to allocate more than additional 8gb of memory

Apr 17 '23 07:04 KukumavMozolo

I was able to fix this issue by rolling back accelerate, peft, bitsandbytes and transformers to a commit dated around 5-6 april when my previous finetunes were successful. Didn't change any parameters and everything worked.

It's definitely an issue with one of these dependencies, need to pin point which one. Issue is not in PyTorch.

Apr 17 '23 09:04 lksysML

I checked and bitsandbytes got bumped to 0.38.0 a few days ago, using bitsandbytes ==0.37.2 fixes it for me

Apr 17 '23 09:04 KukumavMozolo

I checked and bitsandbytes got bumped to 0.38.0 a few days ago, using bitsandbytes ==0.37.2 fixes it for me

Super!

Apr 17 '23 09:04 lksysML

thks, it is useful.

lksysML @.***> 于2023年4月17日周一 17:52写道：

I checked and bitsandbytes got bumped to 0.38.0 a few days ago, using bitsandbytes ==0.37.2 fixes it for me

Super!

— Reply to this email directly, view it on GitHub https://github.com/tloen/alpaca-lora/issues/350#issuecomment-1511038852, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASCDHGYXIHNKRJWXP6CU46DXBUHGFANCNFSM6AAAAAAXAYA7NM . You are receiving this because you authored the thread.Message ID: @.***>

Apr 18 '23 09:04 yangxuan14nlp

为什么我3090 24g，跑llama-7b就报CUDA out of memory了？？又试了下两张3090还是同样的错误 model = LlamaForCausalLM.from_pretrained( 在这步加载模型的时候就报了： RuntimeError: CUDA error: out of memory 这是我的参数设置： Training Alpaca-LoRA model with params: base_model: ../LLaMA-7B data_path: ./instruction_data.json output_dir: ./lora-alpaca batch_size: 24 micro_batch_size: 1 num_epochs: 3 learning_rate: 0.0003 cutoff_len: 400 val_set_size: 2000 lora_r: 8 lora_alpha: 16 lora_dropout: 0.05 lora_target_modules: ['q_proj', 'v_proj'] train_on_inputs: True add_eos_token: False group_by_length: False wandb_project: wandb_run_name: wandb_watch: wandb_log_model: resume_from_checkpoint: False prompt template: alpaca_short

Apr 26 '23 01:04 Stark-zheng

I tried peft==0.2.0, bitsandbytes ==0.37.2. But it still run out of memory when validate at second times. 7B model on 24G VRAM

Apr 26 '23 11:04 luxuriance19

为什么我3090 24g，跑llama-7b就报CUDA out of memory了？？又试了下两张3090还是同样的错误 model = LlamaForCausalLM.from_pretrained( 在这步加载模型的时候就报了： RuntimeError: CUDA error: out of memory 这是我的参数设置： Training Alpaca-LoRA model with params: base_model: ../LLaMA-7B data_path: ./instruction_data.json output_dir: ./lora-alpaca batch_size: 24 micro_batch_size: 1 num_epochs: 3 learning_rate: 0.0003 cutoff_len: 400 val_set_size: 2000 lora_r: 8 lora_alpha: 16 lora_dropout: 0.05 lora_target_modules: ['q_proj', 'v_proj'] train_on_inputs: True add_eos_token: False group_by_length: False wandb_project: wandb_run_name: wandb_watch: wandb_log_model: resume_from_checkpoint: False prompt template: alpaca_short

我和你一样的