LLaMA-Factory
LLaMA-Factory copied to clipboard
DBRX using more gpu memory than mixtral 8x22B for fsdp+qlora
Reminder
- [X] I have read the README and searched the existing issues.
Reproduction
I'm using the following command which can do the fsdp+qlora for mistral 8x22B in one 8xA100(80G) node. But I get CUDA OOM for the same parameters and same dataset for dbrx_instruct. Seems DBRX's size is smaller than mistral 8x22B, how can this happen?
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 accelerate launch --config_file ../../accelerate/fsdp_config.yaml ../../../src/train_bash.py --stage sft --do_train --do_eval --model_name_or_path model_path --dataset sample_dataset --dataset_dir data --template mistral(dbrx) --finetuning_type lora --lora_target all --output_dir output_dir --overwrite_cache --overwrite_output_dir --cutoff_len 16000 --preprocessing_num_workers 2 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --gradient_accumulation_steps 1 --lr_scheduler_type cosine --logging_steps 1 --warmup_steps 1 --save_steps 100 --eval_steps 1 --evaluation_strategy steps --learning_rate 5e-5 --num_train_epochs 2 --max_samples 1000 --val_size 0.2 --quantization_bit 4 --fp16
Expected behavior
No response
System Info
No response
Others
No response
I also tried using lora in 1 8xA100(80G) node for dbrx and it works. So it means fsdp+qlora uses more gpu memory than lora which is not expected. Maybe there are some bugs in the fsdp setting?