accelerate icon indicating copy to clipboard operation
accelerate copied to clipboard

[DeepSpeed] Asking for feedback when training with zero2 with accelerate and diffusers

Open sayakpaul opened this issue 1 year ago • 5 comments

TL;DR: This is not really an issue reporting. This is more of asking for feedback if the changes I made to make accelerate and deepspeed work together.

So, I have this training script: https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image.py. The following patch reflects the changes I made to it to make it work with DeepSpeed:

ds.patch

With these changes, I am able to successfully save and resume from checkpoints when using the following accelerate config (in a multi-GPU single-node setup):

compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  gradient_accumulation_steps: 1
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: false
  zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Example training commands

Start an initial run:

CUDA_VISIBLE_DEVICES=2,3 accelerate launch --config_file ds2.yaml train_text_to_image.py   --pretrained_model_name_or_path=$MODEL_NAME   --dataset_name=$DATASET_NAME   --mixed_precision="fp16"   --resolution=512 --center_crop --random_flip   --train_batch_size=1   --gradient_checkpointing   --max_train_steps=4   --learning_rate=1e-05   --checkpointing_steps=1 --output_dir=/raid/.cache/huggingface/finetuning-ds

Then resume from a checkpoint:

CUDA_VISIBLE_DEVICES=2,3 accelerate launch --config_file ds2.yaml train_text_to_image.py   --pretrained_model_name_or_path=$MODEL_NAME   --dataset_name=$DATASET_NAME   --mixed_precision="fp16"   --resolution=512 --center_crop --random_flip   --train_batch_size=1   --gradient_checkpointing   --max_train_steps=6   --learning_rate=1e-05 --resume_from_checkpoint="latest"  --checkpointing_steps=1 --output_dir=/raid/.cache/huggingface/finetuning-ds

Python dependencies to run the script can be found here: https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/requirements.txt.

accelerate-cli env

Copy-and-paste the text below in your GitHub issue

- `Accelerate` version: 0.30.1
- Platform: Linux-5.4.0-166-generic-x86_64-with-glibc2.31
- `accelerate` bash location: /home/sayak/.pyenv/versions/3.10.12/envs/diffusers/bin/accelerate
- Python version: 3.10.12
- Numpy version: 1.24.1
- PyTorch version (GPU?): 2.3.0+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- System RAM: 503.54 GB
- GPU type: NVIDIA A100-SXM4-80GB
- `Accelerate` default config:
        Not found

LMK if you need more information here.

sayakpaul avatar May 16 '24 12:05 sayakpaul

Cc: @muellerzr @SunMarc

sayakpaul avatar May 16 '24 12:05 sayakpaul

cc @pacman100

SunMarc avatar May 16 '24 16:05 SunMarc

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Jun 15 '24 15:06 github-actions[bot]

Is there anyone looking into the DeepSpeed integration at the moment?

sayakpaul avatar Jun 24 '24 15:06 sayakpaul

Sorry for the delay, there is a lot of backlog for deepspeed/FSDP issues since @pacman100 left but @muellerzr and I will be looking at these issues asap.

SunMarc avatar Jun 24 '24 16:06 SunMarc

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Jul 19 '24 15:07 github-actions[bot]

Not stale

sayakpaul avatar Jul 19 '24 15:07 sayakpaul

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Aug 13 '24 15:08 github-actions[bot]

Not stale

sayakpaul avatar Aug 13 '24 15:08 sayakpaul

Hello hello! I think I know enough to answer yes/no with mild confidence now 🫡 Ty for your patience 🙇

IIUC, the key areas of the diff are:

  1. During model saving, you unwrap manually before calling .save_pretrained:
                 for i, model in enumerate(models):
-                    model.save_pretrained(os.path.join(output_dir, "unet"))
+                    if isinstance(unwrap_model(model), type(unwrap_model(unet))):
+                        unwrap_model(model).save_pretrained(os.path.join(output_dir, "unet"))
  1. During loading you first unwrap the model and then call register_to_config
-                load_model = UNet2DConditionModel.from_pretrained(input_dir, subfolder="unet")
-                model.register_to_config(**load_model.config)
+                if isinstance(unwrap_model(model), type(unwrap_model(unet))):
+                    load_model = UNet2DConditionModel.from_pretrained(input_dir, subfolder="unet")
+                    model = unwrap_model(model)
+                    model.register_to_config(**load_model.config)
 
-                model.load_state_dict(load_model.state_dict())
-                del load_model
+                    model.load_state_dict(load_model.state_dict())
+                    del load_model

Re; 1: If your save_pretrained doesn't do this unwrapping, yes this is indeed the proper way to do so. This is exactly how transformers is setup during _save (as it too doesn't unwrap):

            if isinstance(self.accelerator.unwrap_model(self.model), supported_classes):
                self.accelerator.unwrap_model(self.model).save_pretrained(
                    output_dir, state_dict=state_dict, safe_serialization=self.args.save_safetensors
                )

Re; 2:

This also is correct, because we need to get access to the lowest level model to add it on.

If you'd like to test to be certain that the wrapped model has your modifications, you can look in model.module.module.xxx.mything (for as many .module there are)

muellerzr avatar Aug 14 '24 19:08 muellerzr

Ah alright. Thanks for the confirmation!

This thread will be helpful for folks who use diffusers to train with DeepSpeed, hopefully. Cc @bghira as well. I know you have workarounds but just wanted to make you aware.

sayakpaul avatar Aug 15 '24 01:08 sayakpaul

@sayakpaul does not work. When i run my Flux Dreambooth training with this config, i get this error:

Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py", line 1159, in launch_command
    deepspeed_launcher(args)
  File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py", line 813, in deepspeed_launcher
    raise ImportError("DeepSpeed is not installed => run `pip3 install deepspeed` or build it from source.")
ImportError: DeepSpeed is not installed => run `pip3 install deepspeed` or build it from source.

This is fine, sure. But what's not is this error when I try to install deepspeed with pip install deepspeed

Collecting deepspeed
  Downloading deepspeed-0.15.1.tar.gz (1.4 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.4/1.4 MB 46.6 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error
  
  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [8 lines of output]
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/tmp/pip-install-83bb876q/deepspeed_0e9decbacdab4f48bb7c273e00dd20cd/setup.py", line 108, in <module>
          cuda_major_ver, cuda_minor_ver = installed_cuda_version()
        File "/tmp/pip-install-83bb876q/deepspeed_0e9decbacdab4f48bb7c273e00dd20cd/op_builder/builder.py", line 51, in installed_cuda_version
          raise MissingCUDAException("CUDA_HOME does not exist, unable to compile CUDA op(s)")
      op_builder.builder.MissingCUDAException: CUDA_HOME does not exist, unable to compile CUDA op(s)
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

Any idea how to fix it?

kopyl avatar Sep 25 '24 15:09 kopyl

I have no idea what my CUDA_HOME env var shall be

kopyl avatar Sep 25 '24 15:09 kopyl

@sayakpaul installing nvcc with apt install nvidia-cuda-toolkit helped installing deepspeed

kopyl avatar Sep 25 '24 15:09 kopyl

this is a container-level issue really. you can try simpletuner which has thorough installation instructions and maintains these issues more closely. it is loosely affiliated with diffusers, though, is not a HF project.

bghira avatar Sep 25 '24 15:09 bghira

@bghira thanks, but i think i'd rather wait for someone to fix the HF trainer than to figure out how yet another (hopefully working) library works.

kopyl avatar Sep 25 '24 15:09 kopyl

the huggingface example training scripts are "thin", and do not contain all the necessary bits of logic to actually use DeepSpeed, and are not intended on being a complete example for eg. multigpu or ZeRO training.

bghira avatar Sep 25 '24 15:09 bghira

@bghira but simpletuner does not also have a complete guide on how to do a multi-gpu Flux Dreambooth training.

kopyl avatar Sep 25 '24 15:09 kopyl

did ya look? https://github.com/bghira/SimpleTuner/blob/main/documentation/DEEPSPEED.md

bghira avatar Sep 25 '24 15:09 bghira

@bghira there is no word like "flux" or even "dreambooth" in that article.

kopyl avatar Sep 25 '24 15:09 kopyl

@bghira sorry, i might be dumb, but i was unable to find how to launch the training. Only how to configure it.

kopyl avatar Sep 25 '24 15:09 kopyl

@bghira do you have a Docker container that has everything ready for a Dreambooth flux training including the complete guide on how to start it?

kopyl avatar Sep 25 '24 15:09 kopyl