4张80G的A100好像不能支持基于lora的7b bloom在batch为4的条件下训练,Colossalai是可以的,比较困惑,我对比了一下,batch只能设置到1
#!/bin/bash
Copyright (c) Microsoft Corporation.
SPDX-License-Identifier: Apache-2.0
DeepSpeed Team
Note that usually LoRA needs to use larger learning rate
OUTPUT_PATH=/mnt/bn/simple-nas/mlx/users/zhangyawei.ywsq/playground/arnold_ywsq/DeepSpeedExamples/applications/DeepSpeed-Chat/save/actor-models/7b1_bloom_lora mkdir -p $OUTPUT_PATH
deepspeed --master_port 25104 --num_gpus 4 main.py
--data_path xxx
--data_split 10,0,0
--model_name_or_path xxx
--per_device_train_batch_size 4
--per_device_eval_batch_size 4
--max_seq_len 2048
--learning_rate 1e-3
--weight_decay 0.1
--num_train_epochs 3
--gradient_accumulation_steps 1
--lr_scheduler_type cosine
--num_warmup_steps 0
--seed 1234
--zero_stage 0
--lora_dim 128
--lora_module_name transformer.h.
--only_optimize_lora
--deepspeed
--output_dir $OUTPUT_PATH
&> $OUTPUT_PATH/training.log
bloomz的字典很大
Enabling gradient checkpointing can save a lot of memory
@yaozhewei 目前框架好像gradient checkpointing和only_optimize_lora不能同时使用
Yes, gradient_checkpointing can save more memory. Non LoRA part is not huge as we freeze all Linear.weight matrices