alpaca-lora
alpaca-lora copied to clipboard
CUDA out of memory
env
command
CUDA_VISIBLE_DEVICES=0 nohup python finetune.py --base_model 'decapoda-research/llama-7b-hf' --data_path '/trans_chinese_alpaca_data.json' --output_dir '/lora-alpaca-zh' > output.log &
params
bin /opt/conda/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching in backup paths...
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /opt/conda/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...
Training Alpaca-LoRA model with params:
base_model: decapoda-research/llama-7b-hf
data_path: /trans_chinese_alpaca_data.json
output_dir: /lora-alpaca-zh
batch_size: 128
micro_batch_size: 4
num_epochs: 3
learning_rate: 0.0003
cutoff_len: 256
val_set_size: 2000
lora_r: 8
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules: ['q_proj', 'v_proj']
train_on_inputs: True
add_eos_token: False
group_by_length: False
wandb_project:
wandb_run_name:
wandb_watch:
wandb_log_model:
resume_from_checkpoint: False
prompt template: alpaca
error
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 44.00 MiB (GPU 0; 14.76 GiB total capacity; 13.46 GiB already allocated; 11.75 MiB free; 14.00 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
17%|█▋ | 200/1164 [5:06:26<24:37:04, 91.93s/it]
stack
17%|█▋ | 200/1164 [5:06:26<24:10:03, 90.2{'eval_loss': 1.1247862577438354, 'eval_runtime': 327.6669, 'eval_samples_per_second': 6.104, 'eval_steps_per_second': 0.763, 'epoch': 0.52}
Traceback (most recent call last):
File "/workspace/alpaca-lora/finetune.py", line 283, in <module>
fire.Fire(train)
File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/workspace/alpaca-lora/finetune.py", line 273, in train
trainer.train(resume_from_checkpoint=resume_from_checkpoint)
File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1662, in train
return inner_training_loop(
File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2006, in _inner_training_loop
self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2291, in _maybe_log_save_evaluate
self._save_checkpoint(model, trial, metrics=metrics)
File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2348, in _save_checkpoint
self.save_model(output_dir, _internal_call=True)
File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2830, in save_model
self._save(output_dir)
File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2873, in _save
state_dict = self.model.state_dict()
File "/workspace/alpaca-lora/finetune.py", line 266, in <lambda>
self, old_state_dict()
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1818, in state_dict
module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1818, in state_dict
module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1818, in state_dict
module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars)
[Previous line repeated 4 more times]
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1815, in state_dict
self._save_to_state_dict(destination, prefix, keep_vars)
File "/opt/conda/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 268, in _save_to_state_dict
self.weight.data = undo_layout(self.state.CxB, self.state.tile_indices)
File "/opt/conda/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 100, in undo_layout
return outputs.reshape(rows, cols).contiguous()
expect
config suggestion for different size gpu memory.
WORLD_SIZE=1 CUDA_VISIBLE_DEVICES=0,1 python finetune.py for using the default finetune.py, or use Deepspeed