[BUG]: Coati Lora incompatible with Gemini & HybridParallel(pp=1), but runs well with HybridParallel(tp>=2)
🐛 Describe the bug
Description
I implemented Coati Lora before parallel fine-tuning for LlaMA-7B, and found:
Geminiruns into Error(s) in loading state_dict for GeminiCheckpointIO: and Train params remained 6.32 BHybridParallel(pp=1)runs into RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn but Train params are set correctly and equals to 38.68 MHybridParallel(pp=2)ran successfully and Train params are divided properly, 19.06 M on master GPU
Considering the efficiency and stability in fine-tuning large models and viability to supply longer seq_len and larger batch_size, I'm sincerely looking forward to a recent update to fully support LoRA in distributed training/fine-tuning.
To Reproduce
-
environmentCUDA=11.7, torch=2.1.2cu118, 2*A100-40G, NCCL backend, python=3.10.14, Ubuntu20.04 -
requirementscolossalai=0.35, loralib=0.1.2, transformer=4.33 . I'm also using flash-attn=2.5.6&dropout-layer-norm=0.1(submodule of flash-atto) with a few verifications to shardformer/modeling/llama.py to implement flash attention for HybridParallel plugin. -
modifications-
finetune.pyunder model loading part:with init_ctx: model = LlamaForCausalLM(config) if args.lora: from coati_lora import convert_to_lora_module # coati_lora is the lora.py copied from Coati model = convert_to_lora_module(model, 16) -
finetune.pyunder arg_parser:parser.add_argument("--lora", action="store_true") parser.add_argument("--ppsize", default=2, type=int) parser.add_argument("--tpsize", default=4, type=int) # Gemini is left unchanged but HybridParallel had modifications if args.plugin == "hybrid_parallel": # modify the param accordingly, default configuration is for llama2-7b # The pptp_size below is an parameter to control DataParallel # and does not matter here args.pptp_size = args.ppsize * args.tpsize plugin = HybridParallelPlugin( tp_size=args.tpsize, pp_size=args.ppsize, num_microbatches=2, microbatch_size=None, enable_jit_fused=False, zero_stage=0, precision="bf16", initial_scale=1, ) -
finetune.shre-written another version:MODEL_NAME="deepseek-coder-6.7b-instruct" DATASET_PATH="" SAVE_DIR="save_checkpoint/$MODEL_NAME" # LoRA # Notice that I did not use DataParallel here CUDA_VISIBLE_DEVICES=3,5 CUDA_LAUNCH_BLOCKING=1 \ nohup colossalai run --nproc_per_node 2 --master_port 29503 \ col_train.py --plugin "gemini" \ --model_path "./model/$MODEL_NAME" --dataset "$DATASET_PATH" \ --save_dir $SAVE_DIR --save_interval 5000 \ --lr 0.00005 --lora --batch_size 2 --max_length 2048 --ppsize 2 --tpsize 1 \ --mixed_precision bf16 --flash_attention \ --tensorboard_dir "log/train/tb_logs" \ > log/train/[$$]${MODEL_NAME}.log &
-
Expected Behavior
Considering the efficiency and stability in fine-tuning large models and viability to supply longer seq_len and larger batch_size, I'm sincerely looking forward to a recent update to fully support LoRA in distributed training/fine-tuning. Specifically, I demand
-
Making
Coati Loracompatible withHybridParallelplugin whenpp_size=1 -
Making
Coati Loracompatible withGeminiplugin -
Further support
Peftin distributed training/fine-tuning, making it compatible withGeminiandHybridParalleland evenflash-attn
Screenshots
Gemini plugin failure
HybridParallel(pp=1,tp=2) failure
HybridParallel(pp=2,tp=1) success
Environment
CUDA11.7 accelerate 0.28.0 colossalai 0.3.5 datasets 2.18.0 dropout-layer-norm 0.1 flash-attn 2.5.6 loralib 0.1.2 ninja 1.11.1.1 numpy 1.26.4 packaging 23.2 peft 0.10.0 ray 2.10.0 safetensors 0.4.2 scipy 1.12.0 sentencepiece 0.2.0 okenizers 0.13.3 torch 2.1.2 tqdm 4.66.2 transformers 4.33.0 triton 2.1.0 xformers 0.0.23.post1
It turn outs to be a problem with tensor parallel in hybrid_parallel plugin, where lora parameters are ignored when building column&row parallel layers.