peft Cannot apply both PEFT QLoRA and DeepSpeed ZeRO3

System Info

- `Accelerate` version: 0.33.0
- Platform: Linux-5.15.133+-x86_64-with-glibc2.35
- `accelerate` bash location: /opt/conda/bin/accelerate
- Python version: 3.10.14
- Numpy version: 1.25.2
- PyTorch version (GPU?): 2.2.0+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- PyTorch MUSA available: False
- System RAM: 1842.60 GB
- GPU type: NVIDIA H100 80GB HBM3
- `Accelerate` default config:
	- compute_environment: LOCAL_MACHINE
	- distributed_type: NO
	- mixed_precision: no
	- use_cpu: True
	- debug: False
	- num_processes: 1
	- machine_rank: 0
	- num_machines: 1
	- rdzv_backend: static
	- same_network: False
	- main_training_function: main
	- enable_cpu_affinity: False
	- downcast_bf16: False
	- tpu_use_cluster: False
	- tpu_use_sudo: False

Who can help?

@stevhliu

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder
[ ] My own task or dataset (give details below)

Reproduction

This line model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3.1-405B-Instruct", ...) throws CUDA OOM, because the parameters are not partitoned, but copied across the GPUs.

command accelerate launch --config_file zero3_config.yaml pretrain.py --num_processes=8 --multi_gpu

pretrain.py

...
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    # bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_storage=torch.bfloat16,
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3.1-405B-Instruct",
    quantization_config=bnb_config,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
)
...

zero3_config.yaml

compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  deepspeed_multinode_launcher: standard
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Expected behavior

PEFT QLoRA (with BitsandBytes) and DeepSpeed ZeRO3 are both applied, so that model parameters are quantized and partitoned. I thought this should be working according to this post, but https://github.com/microsoft/DeepSpeed/issues/5819 says BitsandBytes quantization and ZeRO3 are not compatible. If this is the case, I find the above post quite misleading.

Aug 19 '24 00:08 echo-yi

@echo-yi Does it work for you with a smaller model, like the example from the PEFT docs?

@matthewdouglas @Titus-von-Koeller Could you please take a look, could it be an issue specifically with Llama 405B?

More context: https://github.com/huggingface/transformers/pull/29587

Disable zero.init when using DeepSpeed with QLoRA.

I wonder if this is still needed?

Aug 19 '24 10:08 BenjaminBossan

@BenjaminBossan I tried with "meta-llama/Meta-Llama-3.1-8B-Instruct", "meta-llama/Meta-Llama-3.1-70B-Instruct" and neither worked.

Aug 19 '24 23:08 echo-yi

Thanks for testing those. Since this error occurs already at the stage of loading the base model, it is not directly a PEFT error, though of course PEFT is affected and I'd be ready to update the docs if it is confirmed that DS ZeRO3 doesn't work with bnb. I hope the bnb authors can elucidate us.

Aug 20 '24 10:08 BenjaminBossan

@tjruwase from deepspeed shared this line, indicating applying both quantization and DS ZeRO3 doesn't work.

Aug 20 '24 23:08 echo-yi

shared this line, indicating applying both quantization and DS ZeRO3 doesn't work

Yeah, that was added in the PR I mentioned earlier.

I can confirm that even for smaller models, partitioning does not appear to work. But when I remove quantization and use device_map="auto", the same picture emerges. So I'm actually unsure if there is an issue here with bitsandbytes usage in DeepSpeed ZeRO3 or if something else is amiss.

Aug 21 '24 10:08 BenjaminBossan

@BenjaminBossan When I remove ZeRO3 and use quantization & device_map="auto", partitoning does appear to work.

Aug 21 '24 10:08 echo-yi

Also pinging @muellerzr in case he knows something about this.

Aug 21 '24 10:08 BenjaminBossan

I tested in axolotl against latest transformers, and this seems to work with this qlora+peft+zero3 yaml

base_model: NousResearch/Meta-Llama-3-8B

load_in_4bit: true
datasets:
  - path: tatsu-lab/alpaca
    type: alpaca
val_set_size: 0.0
output_dir: ./outputs/lora-out

sequence_len: 4096
sample_packing: true
pad_to_sequence_len: true

adapter: qlora
lora_r: 32
lora_alpha: 64
lora_dropout: 0.05
lora_target_linear: true

gradient_accumulation_steps: 4
micro_batch_size: 2
num_epochs: 2
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.0002

bf16: auto
tf32: false

gradient_checkpointing: true
logging_steps: 1
flash_attention: true

warmup_steps: 10
evals_per_epoch: 4
saves_per_epoch: 1
deepspeed: deepspeed_configs/zero3_bf16.json
weight_decay: 0.1
special_tokens:
   pad_token: <|end_of_text|>

Sep 05 '24 13:09 winglian

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Sep 29 '24 15:09 github-actions[bot]