alpaca-lora
alpaca-lora copied to clipboard
CUDA out of memory
我使用的是tesla T4 16g显卡,想微调一下7B的模型,每次到第一个epoch 第200个迭代时,就会报显卡内存错误,看起来是验证时导出模型文件内存不够了,但我看https://zhuanlan.zhihu.com/p/616504594中12g的RTX 4070微调是可以的,这是什么原因了,我的尝试为: 1.--micro_batch_size 1 没有用
Training Alpaca-LoRA model with params: base_model: decapoda-research/llama-7b-hf data_path: ./trans_chinese_alpaca_data.json output_dir: ./lora-alpaca-zh batch_size: 128 micro_batch_size: 2 num_epochs: 2 learning_rate: 0.0003 cutoff_len: 256 val_set_size: 2000 lora_r: 8 lora_alpha: 16 lora_dropout: 0.05 lora_target_modules: ['q_proj', 'v_proj'] train_on_inputs: True group_by_length: False wandb_project: wandb_run_name: wandb_watch: wandb_log_model: resume_from_checkpoint: False prompt template: alpaca
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 33/33 [00:18<00:00, 1.79it/s]
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
The tokenizer class you load from this checkpoint is 'LLaMATokenizer'.
The class this function is called from is 'LlamaTokenizer'.
Found cached dataset json (/root/.cache/huggingface/datasets/json/default-d1370d3ed27da33a/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 428.82it/s]
trainable params: 4194304 || all params: 6742609920 || trainable%: 0.06220594176090199
Loading cached split indices for dataset at /root/.cache/huggingface/datasets/json/default-d1370d3ed27da33a/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51/cache-baf974d16126c7f1.arrow and /root/.cache/huggingface/datasets/json/default-d1370d3ed27da33a/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51/cache-6013f18c705337f9.arrow
{'loss': 2.2953, 'learning_rate': 2.9999999999999997e-05, 'epoch': 0.03}
{'loss': 2.208, 'learning_rate': 5.9999999999999995e-05, 'epoch': 0.05}
{'loss': 2.0048, 'learning_rate': 8.999999999999999e-05, 'epoch': 0.08}
{'loss': 1.6192, 'learning_rate': 0.00011999999999999999, 'epoch': 0.1}
{'loss': 1.381, 'learning_rate': 0.00015, 'epoch': 0.13}
{'loss': 1.2977, 'learning_rate': 0.00017999999999999998, 'epoch': 0.15}
{'loss': 1.2597, 'learning_rate': 0.00020999999999999998, 'epoch': 0.18}
{'loss': 1.2318, 'learning_rate': 0.00023999999999999998, 'epoch': 0.21}
{'loss': 1.2307, 'learning_rate': 0.00027, 'epoch': 0.23}
{'loss': 1.2053, 'learning_rate': 0.0003, 'epoch': 0.26}
{'loss': 1.1919, 'learning_rate': 0.0002955621301775148, 'epoch': 0.28}
{'loss': 1.1657, 'learning_rate': 0.00029112426035502955, 'epoch': 0.31}
{'loss': 1.1413, 'learning_rate': 0.00028668639053254437, 'epoch': 0.33}
{'loss': 1.1372, 'learning_rate': 0.00028224852071005914, 'epoch': 0.36}
{'loss': 1.1229, 'learning_rate': 0.00027781065088757395, 'epoch': 0.39}
{'loss': 1.1173, 'learning_rate': 0.0002733727810650887, 'epoch': 0.41}
{'loss': 1.1279, 'learning_rate': 0.00026893491124260353, 'epoch': 0.44}
{'loss': 1.1182, 'learning_rate': 0.0002644970414201183, 'epoch': 0.46}
{'loss': 1.112, 'learning_rate': 0.0002600591715976331, 'epoch': 0.49}
{'loss': 1.0954, 'learning_rate': 0.00025562130177514793, 'epoch': 0.52}
{'eval_loss': 1.1259599924087524, 'eval_runtime': 328.7811, 'eval_samples_per_second': 6.083, 'eval_steps_per_second': 0.76, 'epoch': 0.52}
26%|███████████████████████████████▏ | 200/776 [6:33:46<18:07:50, 113.32s/itTraceback (most recent call last):
File "/new_data/yangxuan/alpaca-lora/finetune.py", line 276, in
先谢!
Same error: https://github.com/tloen/alpaca-lora/issues/344
It errors out at 200 iterations.
@tloen
This seems to be realted to saving the model, my memory usage is aroung 16gb but when trainer trys to save the model or when model.save_pretrained is called the oom occures. So for some reason this line
'self.weight.data = undo_layout(self.state.CxB, self.state.tile_indices)
trys to allocate more than additional 8gb of memory
I was able to fix this issue by rolling back accelerate, peft, bitsandbytes and transformers to a commit dated around 5-6 april when my previous finetunes were successful. Didn't change any parameters and everything worked.
It's definitely an issue with one of these dependencies, need to pin point which one. Issue is not in PyTorch.
I checked and bitsandbytes got bumped to 0.38.0 a few days ago, using bitsandbytes ==0.37.2 fixes it for me
I checked and bitsandbytes got bumped to 0.38.0 a few days ago, using bitsandbytes ==0.37.2 fixes it for me
Super!
thks, it is useful.
lksysML @.***> 于2023年4月17日周一 17:52写道:
I checked and bitsandbytes got bumped to 0.38.0 a few days ago, using bitsandbytes ==0.37.2 fixes it for me
Super!
— Reply to this email directly, view it on GitHub https://github.com/tloen/alpaca-lora/issues/350#issuecomment-1511038852, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASCDHGYXIHNKRJWXP6CU46DXBUHGFANCNFSM6AAAAAAXAYA7NM . You are receiving this because you authored the thread.Message ID: @.***>
为什么我3090 24g,跑llama-7b就报CUDA out of memory了??又试了下两张3090还是同样的错误 model = LlamaForCausalLM.from_pretrained( 在这步加载模型的时候就报了: RuntimeError: CUDA error: out of memory 这是我的参数设置: Training Alpaca-LoRA model with params: base_model: ../LLaMA-7B data_path: ./instruction_data.json output_dir: ./lora-alpaca batch_size: 24 micro_batch_size: 1 num_epochs: 3 learning_rate: 0.0003 cutoff_len: 400 val_set_size: 2000 lora_r: 8 lora_alpha: 16 lora_dropout: 0.05 lora_target_modules: ['q_proj', 'v_proj'] train_on_inputs: True add_eos_token: False group_by_length: False wandb_project: wandb_run_name: wandb_watch: wandb_log_model: resume_from_checkpoint: False prompt template: alpaca_short
I tried peft==0.2.0, bitsandbytes ==0.37.2. But it still run out of memory when validate at second times. 7B model on 24G VRAM
为什么我3090 24g,跑llama-7b就报CUDA out of memory了??又试了下两张3090还是同样的错误 model = LlamaForCausalLM.from_pretrained( 在这步加载模型的时候就报了: RuntimeError: CUDA error: out of memory 这是我的参数设置: Training Alpaca-LoRA model with params: base_model: ../LLaMA-7B data_path: ./instruction_data.json output_dir: ./lora-alpaca batch_size: 24 micro_batch_size: 1 num_epochs: 3 learning_rate: 0.0003 cutoff_len: 400 val_set_size: 2000 lora_r: 8 lora_alpha: 16 lora_dropout: 0.05 lora_target_modules: ['q_proj', 'v_proj'] train_on_inputs: True add_eos_token: False group_by_length: False wandb_project: wandb_run_name: wandb_watch: wandb_log_model: resume_from_checkpoint: False prompt template: alpaca_short
我和你一样的
happening for me right now on latest transformers and bnb 0.37.2..
same issue. tried reverting versions to no avail. currently on 64gb vram
anybody solved this problem?
can anyone try peft 0.2.0 like @cnbeining change in his repo referencing this issue
using bitsandbytes ==0.37.2
if u get 'undefined symbol: cget_col_row_stats' when doing this step, try the following
cp libbitsandbytes_cuda117.so libbitsandbytes_cpu.so
I checked and bitsandbytes got bumped to 0.38.0 a few days ago, using bitsandbytes ==0.37.2 fixes it for me
Super!
Worked for me!