accelerate
accelerate copied to clipboard
[DeepSpeed] Asking for feedback when training with zero2 with accelerate and diffusers
TL;DR: This is not really an issue reporting. This is more of asking for feedback if the changes I made to make accelerate and deepspeed work together.
So, I have this training script: https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image.py. The following patch reflects the changes I made to it to make it work with DeepSpeed:
With these changes, I am able to successfully save and resume from checkpoints when using the following accelerate config (in a multi-GPU single-node setup):
compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
gradient_accumulation_steps: 1
offload_optimizer_device: none
offload_param_device: none
zero3_init_flag: false
zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
Example training commands
Start an initial run:
CUDA_VISIBLE_DEVICES=2,3 accelerate launch --config_file ds2.yaml train_text_to_image.py --pretrained_model_name_or_path=$MODEL_NAME --dataset_name=$DATASET_NAME --mixed_precision="fp16" --resolution=512 --center_crop --random_flip --train_batch_size=1 --gradient_checkpointing --max_train_steps=4 --learning_rate=1e-05 --checkpointing_steps=1 --output_dir=/raid/.cache/huggingface/finetuning-ds
Then resume from a checkpoint:
CUDA_VISIBLE_DEVICES=2,3 accelerate launch --config_file ds2.yaml train_text_to_image.py --pretrained_model_name_or_path=$MODEL_NAME --dataset_name=$DATASET_NAME --mixed_precision="fp16" --resolution=512 --center_crop --random_flip --train_batch_size=1 --gradient_checkpointing --max_train_steps=6 --learning_rate=1e-05 --resume_from_checkpoint="latest" --checkpointing_steps=1 --output_dir=/raid/.cache/huggingface/finetuning-ds
Python dependencies to run the script can be found here: https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/requirements.txt.
accelerate-cli env
Copy-and-paste the text below in your GitHub issue
- `Accelerate` version: 0.30.1
- Platform: Linux-5.4.0-166-generic-x86_64-with-glibc2.31
- `accelerate` bash location: /home/sayak/.pyenv/versions/3.10.12/envs/diffusers/bin/accelerate
- Python version: 3.10.12
- Numpy version: 1.24.1
- PyTorch version (GPU?): 2.3.0+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- System RAM: 503.54 GB
- GPU type: NVIDIA A100-SXM4-80GB
- `Accelerate` default config:
Not found
LMK if you need more information here.
Cc: @muellerzr @SunMarc
cc @pacman100
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Is there anyone looking into the DeepSpeed integration at the moment?
Sorry for the delay, there is a lot of backlog for deepspeed/FSDP issues since @pacman100 left but @muellerzr and I will be looking at these issues asap.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Not stale
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Not stale
Hello hello! I think I know enough to answer yes/no with mild confidence now 🫡 Ty for your patience 🙇
IIUC, the key areas of the diff are:
- During model saving, you unwrap manually before calling
.save_pretrained:
for i, model in enumerate(models):
- model.save_pretrained(os.path.join(output_dir, "unet"))
+ if isinstance(unwrap_model(model), type(unwrap_model(unet))):
+ unwrap_model(model).save_pretrained(os.path.join(output_dir, "unet"))
- During loading you first
unwrapthe model and then callregister_to_config
- load_model = UNet2DConditionModel.from_pretrained(input_dir, subfolder="unet")
- model.register_to_config(**load_model.config)
+ if isinstance(unwrap_model(model), type(unwrap_model(unet))):
+ load_model = UNet2DConditionModel.from_pretrained(input_dir, subfolder="unet")
+ model = unwrap_model(model)
+ model.register_to_config(**load_model.config)
- model.load_state_dict(load_model.state_dict())
- del load_model
+ model.load_state_dict(load_model.state_dict())
+ del load_model
Re; 1: If your save_pretrained doesn't do this unwrapping, yes this is indeed the proper way to do so. This is exactly how transformers is setup during _save (as it too doesn't unwrap):
if isinstance(self.accelerator.unwrap_model(self.model), supported_classes):
self.accelerator.unwrap_model(self.model).save_pretrained(
output_dir, state_dict=state_dict, safe_serialization=self.args.save_safetensors
)
Re; 2:
This also is correct, because we need to get access to the lowest level model to add it on.
If you'd like to test to be certain that the wrapped model has your modifications, you can look in model.module.module.xxx.mything (for as many .module there are)
Ah alright. Thanks for the confirmation!
This thread will be helpful for folks who use diffusers to train with DeepSpeed, hopefully. Cc @bghira as well. I know you have workarounds but just wanted to make you aware.
@sayakpaul does not work. When i run my Flux Dreambooth training with this config, i get this error:
Traceback (most recent call last):
File "/usr/local/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py", line 1159, in launch_command
deepspeed_launcher(args)
File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py", line 813, in deepspeed_launcher
raise ImportError("DeepSpeed is not installed => run `pip3 install deepspeed` or build it from source.")
ImportError: DeepSpeed is not installed => run `pip3 install deepspeed` or build it from source.
This is fine, sure. But what's not is this error when I try to install deepspeed with pip install deepspeed
Collecting deepspeed
Downloading deepspeed-0.15.1.tar.gz (1.4 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.4/1.4 MB 46.6 MB/s eta 0:00:00
Preparing metadata (setup.py) ... error
error: subprocess-exited-with-error
× python setup.py egg_info did not run successfully.
│ exit code: 1
╰─> [8 lines of output]
Traceback (most recent call last):
File "<string>", line 2, in <module>
File "<pip-setuptools-caller>", line 34, in <module>
File "/tmp/pip-install-83bb876q/deepspeed_0e9decbacdab4f48bb7c273e00dd20cd/setup.py", line 108, in <module>
cuda_major_ver, cuda_minor_ver = installed_cuda_version()
File "/tmp/pip-install-83bb876q/deepspeed_0e9decbacdab4f48bb7c273e00dd20cd/op_builder/builder.py", line 51, in installed_cuda_version
raise MissingCUDAException("CUDA_HOME does not exist, unable to compile CUDA op(s)")
op_builder.builder.MissingCUDAException: CUDA_HOME does not exist, unable to compile CUDA op(s)
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed
× Encountered error while generating package metadata.
╰─> See above for output.
note: This is an issue with the package mentioned above, not pip.
hint: See above for details.
Any idea how to fix it?
I have no idea what my CUDA_HOME env var shall be
@sayakpaul installing nvcc with apt install nvidia-cuda-toolkit helped installing deepspeed
this is a container-level issue really. you can try simpletuner which has thorough installation instructions and maintains these issues more closely. it is loosely affiliated with diffusers, though, is not a HF project.
@bghira thanks, but i think i'd rather wait for someone to fix the HF trainer than to figure out how yet another (hopefully working) library works.
the huggingface example training scripts are "thin", and do not contain all the necessary bits of logic to actually use DeepSpeed, and are not intended on being a complete example for eg. multigpu or ZeRO training.
@bghira but simpletuner does not also have a complete guide on how to do a multi-gpu Flux Dreambooth training.
did ya look? https://github.com/bghira/SimpleTuner/blob/main/documentation/DEEPSPEED.md
@bghira there is no word like "flux" or even "dreambooth" in that article.
@bghira sorry, i might be dumb, but i was unable to find how to launch the training. Only how to configure it.
@bghira do you have a Docker container that has everything ready for a Dreambooth flux training including the complete guide on how to start it?