accelerate
accelerate copied to clipboard
memory bug in using accelerate with deepspeed to train diffusion models
System Info
accelerate: 0.22.0
python:3.8.18
config yaml:
compute_environment: LOCAL_MACHINE
debug: true
deepspeed_config:
gradient_accumulation_steps: 1
offload_optimizer_device: none
offload_param_device: none
zero3_init_flag: False
zero3_save_16bit_model: False
overlap_comm: True
zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 6
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
Information
- [X] The official example scripts
- [ ] My own modified scripts
Tasks
- [ ] One of the scripts in the examples/ folder of Accelerate or an officially supported
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
) - [X] My own task or dataset (give details below)
Reproduction
i use the example training code in diffusers repo to finetune the stable diffusion. my train command is :
accelerate launch --config_file ./deepspeed.yaml --mixed_precision="fp16" train_text_to_image.py \
--pretrained_model_name_or_path=$MODEL_NAME \
--dataset_name=$dataset_name \
--resolution=256 --center_crop --random_flip \
--train_batch_size=1 \
--gradient_accumulation_steps=4 \
--max_train_steps=15000 \
--learning_rate=1e-05 \
--max_grad_norm=1 \
--lr_scheduler="constant" --lr_warmup_steps=0 \
--enable_xformers_memory_efficient_attention \
--output_dir="sd-pokemon-model"
when i use deepspeed stage2 to train the model, it cost about 7GB vram per gpu. however the process cost 9GB vram per gpu when use stage3. that is a bug in accelerate or deepspeed? because theoretically, stage3 should not cost more vram than stage2.
Expected behavior
how to use stage3 to reduce memory consumption?
cc @pacman100
@pacman100
@pacman100
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.