LLaMA-Factory DBRX using more gpu memory than mixtral 8x22B for fsdp+qlora

DBRX using more gpu memory than mixtral 8x22B for fsdp+qlora

Open mces89 opened this issue 9 months ago • 1 comments

Reminder

[X] I have read the README and searched the existing issues.

Reproduction

I'm using the following command which can do the fsdp+qlora for mistral 8x22B in one 8xA100(80G) node. But I get CUDA OOM for the same parameters and same dataset for dbrx_instruct. Seems DBRX's size is smaller than mistral 8x22B, how can this happen?

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 accelerate launch --config_file ../../accelerate/fsdp_config.yaml ../../../src/train_bash.py --stage sft --do_train --do_eval --model_name_or_path model_path --dataset sample_dataset --dataset_dir data --template mistral(dbrx) --finetuning_type lora --lora_target all --output_dir output_dir --overwrite_cache --overwrite_output_dir --cutoff_len 16000 --preprocessing_num_workers 2 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --gradient_accumulation_steps 1 --lr_scheduler_type cosine --logging_steps 1 --warmup_steps 1 --save_steps 100 --eval_steps 1 --evaluation_strategy steps --learning_rate 5e-5 --num_train_epochs 2 --max_samples 1000 --val_size 0.2 --quantization_bit 4 --fp16

Expected behavior

No response

System Info

No response

Others

No response

Apr 30 '24 02:04 mces89

I also tried using lora in 1 8xA100(80G) node for dbrx and it works. So it means fsdp+qlora uses more gpu memory than lora which is not expected. Maybe there are some bugs in the fsdp setting?

Apr 30 '24 15:04 mces89

LLaMA-Factory LLaMA-Factory copied to clipboard

DBRX using more gpu memory than mixtral 8x22B for fsdp+qlora

Reminder

Reproduction

Expected behavior

System Info

Others

LLaMA-Factory
LLaMA-Factory copied to clipboard