accelerate icon indicating copy to clipboard operation
accelerate copied to clipboard

memory bug in using accelerate with deepspeed to train diffusion models

Open zhangvia opened this issue 4 months ago • 3 comments

System Info

accelerate: 0.22.0
python:3.8.18
config yaml:
compute_environment: LOCAL_MACHINE
debug: true
deepspeed_config:
  gradient_accumulation_steps: 1
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: False
  zero3_save_16bit_model: False
  overlap_comm: True
  zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 6
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Information

  • [X] The official example scripts
  • [ ] My own modified scripts

Tasks

  • [ ] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • [X] My own task or dataset (give details below)

Reproduction

i use the example training code in diffusers repo to finetune the stable diffusion. my train command is :

accelerate launch --config_file ./deepspeed.yaml --mixed_precision="fp16" train_text_to_image.py \
	--pretrained_model_name_or_path=$MODEL_NAME \
	--dataset_name=$dataset_name \
	--resolution=256 --center_crop --random_flip \
	--train_batch_size=1 \
	--gradient_accumulation_steps=4 \
	--max_train_steps=15000 \
	--learning_rate=1e-05 \
	--max_grad_norm=1 \
	--lr_scheduler="constant" --lr_warmup_steps=0 \
	--enable_xformers_memory_efficient_attention \
	--output_dir="sd-pokemon-model"

when i use deepspeed stage2 to train the model, it cost about 7GB vram per gpu. however the process cost 9GB vram per gpu when use stage3. that is a bug in accelerate or deepspeed? because theoretically, stage3 should not cost more vram than stage2.

Expected behavior

how to use stage3 to reduce memory consumption?

zhangvia avatar Feb 21 '24 07:02 zhangvia

cc @pacman100

SunMarc avatar Feb 23 '24 17:02 SunMarc

@pacman100

zhangvia avatar Mar 06 '24 05:03 zhangvia

@pacman100

zhangvia avatar Mar 19 '24 01:03 zhangvia

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Apr 12 '24 15:04 github-actions[bot]