LOMO icon indicating copy to clipboard operation
LOMO copied to clipboard

LlaMA-7B + LoRA在16GB的V100上OOM

Open zhenqincn opened this issue 1 year ago • 2 comments

尊敬的作者您好,我按照库中的配置,将per_device_train_batch_sizeper_device_eval_batch_size都设置为1,发现在单卡16GB的V100上运行lomo_lora_trainer.py训练LlaMA-7B会出现OOM的问题。

具体配置如下

# model
model_name_or_path: 'openlm-research/open_llama_7b'
# data
dataset_name: 'wic'
refresh: false
data_tag: 'base'
train_on_inputs: false
data_max_length: 1024
# training
# trainer
peft_type: 'lora'
lora_only: false
hf_learning_rate: 0.0005
hf_weight_decay: 0
hf_lr_scheduler_type: 'linear'
hf_warmup: 0.05
tag: 'lora-qv-r2-lomo'
output_dir: 'outputs'
overwrite_output_dir: true
deepspeed: 'config/ds_config_lora.json'
do_train: true
do_eval: true
evaluation_strategy: 'epoch'
per_device_train_batch_size: 1
per_device_eval_batch_size: 1
learning_rate: 0.005
weight_decay: 0
num_train_epochs: 10
lr_scheduler_type: 'linear'
warmup: 0.05
clip_grad_norm: 1.0
#clip_grad_value: 1.0
#clip_loss_value: 5.0
log_level: 'info'
logging_steps: 1
# please set `resume_from_checkpoint` to load checkpoints. check `merge_llama_with_lora.py` first.
#resume_from_checkpoint: 'outputs/wic_7B_lora-qv-r2-lomo/output_lr0.005_bs16_warmup0.05_clipnorm1.0/checkpoint-0/merge_weights'
# please set `save_strategy` (`no`, `epoch`, `steps`) and `save_total_limit` (the max amount of checkpoints) to save checkpoints.
save_strategy: 'no'
save_total_limit: 0
seed: 42
#bf16: true
remove_unused_columns: false
load_best_model_at_end: false
metric_for_best_model: 'acc'
optim: 'sgd'
group_by_length: false
#report_to: 'wandb'
dataloader_pin_memory: false
gradient_checkpointing: true
predict_with_generate: false
lora_r: 2

顺便说一下,我在按照上述同样的配置,不用lora的情况下,在16GB的V100上通过LOMO训练LlaMA-7B将占用15933MB的显存,和论文中的结果似乎不太一样。请问是哪里我配置得不对吗?

zhenqincn avatar Aug 30 '23 02:08 zhenqincn

你好,我在测试lomo+lora显存的时候使用的是3090,有24GB显存,一张卡就可以。V100可能需要两张。 论文里的显存是使用torch.cuda.memory_reserved()测试的,会比使用nvidia-smi等来监测少一点,是正常的。

KaiLv69 avatar Aug 31 '23 13:08 KaiLv69

非常感谢您的解答。 我看到您的论文的Table 2中提到,一个7B的模型在单卡3090上通过LOMO训练,占用显存13.61GB。感觉如果加上LoRA(r=2)不应该在16GB的卡上就直接OOM了,能否跟您请教下这个情况是否是正常的?

zhenqincn avatar Sep 01 '23 08:09 zhenqincn