accelerate memory bug in using accelerate with deepspeed to train diffusion models

memory bug in using accelerate with deepspeed to train diffusion models

Open zhangvia opened this issue 1 year ago • 3 comments

System Info

accelerate: 0.22.0
python:3.8.18
config yaml:
compute_environment: LOCAL_MACHINE
debug: true
deepspeed_config:
  gradient_accumulation_steps: 1
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: False
  zero3_save_16bit_model: False
  overlap_comm: True
  zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 6
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[ ] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
[X] My own task or dataset (give details below)

Reproduction

i use the example training code in diffusers repo to finetune the stable diffusion. my train command is :

accelerate launch --config_file ./deepspeed.yaml --mixed_precision="fp16" train_text_to_image.py \
	--pretrained_model_name_or_path=$MODEL_NAME \
	--dataset_name=$dataset_name \
	--resolution=256 --center_crop --random_flip \
	--train_batch_size=1 \
	--gradient_accumulation_steps=4 \
	--max_train_steps=15000 \
	--learning_rate=1e-05 \
	--max_grad_norm=1 \
	--lr_scheduler="constant" --lr_warmup_steps=0 \
	--enable_xformers_memory_efficient_attention \
	--output_dir="sd-pokemon-model"

when i use deepspeed stage2 to train the model, it cost about 7GB vram per gpu. however the process cost 9GB vram per gpu when use stage3. that is a bug in accelerate or deepspeed? because theoretically, stage3 should not cost more vram than stage2.

Expected behavior

how to use stage3 to reduce memory consumption?

Feb 21 '24 07:02 zhangvia

cc @pacman100

Feb 23 '24 17:02 SunMarc

@pacman100

Mar 06 '24 05:03 zhangvia

@pacman100

Mar 19 '24 01:03 zhangvia

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Apr 12 '24 15:04 github-actions[bot]

accelerate accelerate copied to clipboard

memory bug in using accelerate with deepspeed to train diffusion models

System Info

Information

Tasks

Reproduction

Expected behavior

accelerate
accelerate copied to clipboard