4张80G的A100好像不能支持基于lora的7b bloom在batch为4的条件下训练，Colossalai是可以的，比较困惑，我对比了一下，batch只能设置到1

Open NostalgiaOfTime opened this issue 2 years ago • 2 comments

#!/bin/bash

Copyright (c) Microsoft Corporation.

SPDX-License-Identifier: Apache-2.0

DeepSpeed Team

Note that usually LoRA needs to use larger learning rate

OUTPUT_PATH=/mnt/bn/simple-nas/mlx/users/zhangyawei.ywsq/playground/arnold_ywsq/DeepSpeedExamples/applications/DeepSpeed-Chat/save/actor-models/7b1_bloom_lora mkdir -p $OUTPUT_PATH

deepspeed --master_port 25104 --num_gpus 4 main.py
--data_path xxx
--data_split 10,0,0
--model_name_or_path xxx
--per_device_train_batch_size 4
--per_device_eval_batch_size 4
--max_seq_len 2048
--learning_rate 1e-3
--weight_decay 0.1
--num_train_epochs 3
--gradient_accumulation_steps 1
--lr_scheduler_type cosine
--num_warmup_steps 0
--seed 1234
--zero_stage 0
--lora_dim 128
--lora_module_name transformer.h.
--only_optimize_lora
--deepspeed
--output_dir $OUTPUT_PATH
&> $OUTPUT_PATH/training.log

Apr 24 '23 06:04 NostalgiaOfTime

bloomz的字典很大

Apr 24 '23 09:04 janelu9

Enabling gradient checkpointing can save a lot of memory

Apr 24 '23 17:04 yaozhewei

@yaozhewei 目前框架好像gradient checkpointing和only_optimize_lora不能同时使用

Apr 26 '23 13:04 NostalgiaOfTime

Yes, gradient_checkpointing can save more memory. Non LoRA part is not huge as we freeze all Linear.weight matrices

Apr 28 '23 16:04 yaozhewei