alpaca-lora icon indicating copy to clipboard operation
alpaca-lora copied to clipboard

CUDA out of memory

Open nkuacac opened this issue 1 year ago • 1 comments

env

image

command

CUDA_VISIBLE_DEVICES=0 nohup python finetune.py --base_model 'decapoda-research/llama-7b-hf' --data_path '/trans_chinese_alpaca_data.json' --output_dir '/lora-alpaca-zh' > output.log &

params

bin /opt/conda/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching in backup paths...
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /opt/conda/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...
Training Alpaca-LoRA model with params:
base_model: decapoda-research/llama-7b-hf
data_path: /trans_chinese_alpaca_data.json
output_dir: /lora-alpaca-zh
batch_size: 128
micro_batch_size: 4
num_epochs: 3
learning_rate: 0.0003
cutoff_len: 256
val_set_size: 2000
lora_r: 8
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules: ['q_proj', 'v_proj']
train_on_inputs: True
add_eos_token: False
group_by_length: False
wandb_project:
wandb_run_name:
wandb_watch:
wandb_log_model:
resume_from_checkpoint: False
prompt template: alpaca

error

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 44.00 MiB (GPU 0; 14.76 GiB total capacity; 13.46 GiB already allocated; 11.75 MiB free; 14.00 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
 17%|█▋        | 200/1164 [5:06:26<24:37:04, 91.93s/it]

stack

 17%|█▋        | 200/1164 [5:06:26<24:10:03, 90.2{'eval_loss': 1.1247862577438354, 'eval_runtime': 327.6669, 'eval_samples_per_second': 6.104, 'eval_steps_per_second': 0.763, 'epoch': 0.52}
Traceback (most recent call last):
  File "/workspace/alpaca-lora/finetune.py", line 283, in <module>
    fire.Fire(train)
  File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/workspace/alpaca-lora/finetune.py", line 273, in train
    trainer.train(resume_from_checkpoint=resume_from_checkpoint)
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1662, in train
    return inner_training_loop(
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2006, in _inner_training_loop
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2291, in _maybe_log_save_evaluate
    self._save_checkpoint(model, trial, metrics=metrics)
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2348, in _save_checkpoint
    self.save_model(output_dir, _internal_call=True)
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2830, in save_model
    self._save(output_dir)
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2873, in _save
    state_dict = self.model.state_dict()
  File "/workspace/alpaca-lora/finetune.py", line 266, in <lambda>
    self, old_state_dict()
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1818, in state_dict
    module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1818, in state_dict
    module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1818, in state_dict
    module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars)
  [Previous line repeated 4 more times]
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1815, in state_dict
    self._save_to_state_dict(destination, prefix, keep_vars)
  File "/opt/conda/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 268, in _save_to_state_dict
    self.weight.data = undo_layout(self.state.CxB, self.state.tile_indices)
  File "/opt/conda/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 100, in undo_layout
    return outputs.reshape(rows, cols).contiguous()

expect

config suggestion for different size gpu memory.

nkuacac avatar Apr 20 '23 23:04 nkuacac

WORLD_SIZE=1 CUDA_VISIBLE_DEVICES=0,1 python finetune.py for using the default finetune.py, or use Deepspeed

lywinged avatar Apr 21 '23 15:04 lywinged