[BUG]: Coati Lora incompatible with Gemini & HybridParallel(pp=1), but runs well with HybridParallel(tp>=2)

Open Fallqs opened this issue 1 year ago • 1 comments

🐛 Describe the bug

Description

I implemented Coati Lora before parallel fine-tuning for LlaMA-7B, and found:

Gemini runs into Error(s) in loading state_dict for GeminiCheckpointIO: and Train params remained 6.32 B
HybridParallel(pp=1) runs into RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn but Train params are set correctly and equals to 38.68 M
HybridParallel(pp=2) ran successfully and Train params are divided properly, 19.06 M on master GPU

Considering the efficiency and stability in fine-tuning large models and viability to supply longer seq_len and larger batch_size, I'm sincerely looking forward to a recent update to fully support LoRA in distributed training/fine-tuning.

To Reproduce

environment CUDA=11.7, torch=2.1.2cu118, 2*A100-40G, NCCL backend, python=3.10.14, Ubuntu20.04
requirements colossalai=0.35, loralib=0.1.2, transformer=4.33 . I'm also using flash-attn=2.5.6&dropout-layer-norm=0.1(submodule of flash-atto) with a few verifications to shardformer/modeling/llama.py to implement flash attention for HybridParallel plugin.

modifications

finetune.py under model loading part:

with init_ctx:
    model = LlamaForCausalLM(config)
    if args.lora:
        from coati_lora import convert_to_lora_module
        # coati_lora is the lora.py copied from Coati
        model = convert_to_lora_module(model, 16)

finetune.py under arg_parser:

parser.add_argument("--lora", action="store_true")
parser.add_argument("--ppsize", default=2, type=int)
parser.add_argument("--tpsize", default=4, type=int)

# Gemini is left unchanged but HybridParallel had modifications
if args.plugin == "hybrid_parallel":
    # modify the param accordingly, default configuration is for llama2-7b
    # The pptp_size below is an parameter to control DataParallel
    # and does not matter here
    args.pptp_size = args.ppsize * args.tpsize
    plugin = HybridParallelPlugin(
        tp_size=args.tpsize,
        pp_size=args.ppsize,
        num_microbatches=2, microbatch_size=None,
        enable_jit_fused=False, zero_stage=0,
        precision="bf16", initial_scale=1,
    )

finetune.sh re-written another version:

MODEL_NAME="deepseek-coder-6.7b-instruct"
DATASET_PATH=""
SAVE_DIR="save_checkpoint/$MODEL_NAME"

# LoRA
# Notice that I did not use DataParallel here
CUDA_VISIBLE_DEVICES=3,5 CUDA_LAUNCH_BLOCKING=1 \
    nohup colossalai run --nproc_per_node 2 --master_port 29503 \
    col_train.py  --plugin "gemini" \
    --model_path "./model/$MODEL_NAME" --dataset "$DATASET_PATH" \
    --save_dir $SAVE_DIR --save_interval 5000 \
    --lr 0.00005 --lora --batch_size 2 --max_length 2048 --ppsize 2 --tpsize 1 \
    --mixed_precision bf16 --flash_attention \
    --tensorboard_dir "log/train/tb_logs" \
    > log/train/[$$]${MODEL_NAME}.log &

Expected Behavior

Making Coati Lora compatible with HybridParallel plugin when pp_size=1
Making Coati Lora compatible with Gemini plugin
Further support Peft in distributed training/fine-tuning, making it compatible with Gemini and HybridParallel and even flash-attn

Screenshots

Gemini plugin failure

HybridParallel(pp=1,tp=2) failure

HybridParallel(pp=2,tp=1) success

Environment

CUDA11.7 accelerate 0.28.0 colossalai 0.3.5 datasets 2.18.0 dropout-layer-norm 0.1 flash-attn 2.5.6 loralib 0.1.2 ninja 1.11.1.1 numpy 1.26.4 packaging 23.2 peft 0.10.0 ray 2.10.0 safetensors 0.4.2 scipy 1.12.0 sentencepiece 0.2.0 okenizers 0.13.3 torch 2.1.2 tqdm 4.66.2 transformers 4.33.0 triton 2.1.0 xformers 0.0.23.post1

Mar 26 '24 01:03 Fallqs

It turn outs to be a problem with tensor parallel in hybrid_parallel plugin, where lora parameters are ignored when building column&row parallel layers.

Mar 28 '24 06:03 Fallqs