diffusers Training dreambooth on multiple GPUs with deepspeed

Describe the bug

Hi,

I am distributing dream booth training on 8 Tesla V100 (16GB) GPUs using deep speed.

I have configured accelerate as below(output of default_config.yaml): command_file: null commands: null compute_environment: LOCAL_MACHINE deepspeed_config: gradient_accumulation_steps: 1 offload_optimizer_device: cpu offload_param_device: cpu zero3_init_flag: true zero_stage: 2 distributed_type: DEEPSPEED downcast_bf16: 'no' dynamo_backend: 'NO' fsdp_config: {} gpu_ids: null machine_rank: 0 main_process_ip: null main_process_port: null main_training_function: main megatron_lm_config: {} mixed_precision: fp16 num_machines: 1 num_processes: 8 rdzv_backend: static same_network: true tpu_name: null tpu_zone: null use_cpu: false

I am running train_dreambooth.py script as below: accelerate launch train_dreambooth.py --pretrained_model_name_or_path="CompVis/stable-diffusion-v1-4" --instance_data_dir="/home/ubuntu/shoes" --output_dir="/home/ubuntu/sd_dreamboth" --instance_prompt="[v] shoes" --resolution=512 --train_batch_size=1 --gradient_accumulation_steps=1 --learning_rate=2e-6 --lr_scheduler="constant" --lr_warmup_steps=0 --max_train_steps=600

I can observe the GPUs memory consumption just after executing the above script, but the training stops immediately. and I got follwoing error:

` ninja: build stopped: subcommand failed. Traceback (most recent call last): File "/usr/lib/python3/dist-packages/torch/utils/cpp_extension.py", line 1818, in _run_ninja_build subprocess.run( File "/usr/lib/python3.8/subprocess.py", line 516, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

RuntimeError: Error building extension 'utils'

ImportError: /home/ubuntu/.cache/torch_extensions/py38_cu118/utils/utils.so: cannot open shared object file: No such file or directory

ImportError: /home/ubuntu/.cache/torch_extensions/py38_cu118/utils/utils.so: cannot open shared object file: No such file or directory ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 110404) of binary: /usr/bin/python3

`

Kindly recommend me know, If I am configuring accelerate correctly?

Reproduction

accelerate launch train_dreambooth.py --pretrained_model_name_or_path="CompVis/stable-diffusion-v1-4" --instance_data_dir="/home/ubuntu/shoes" --output_dir="/home/ubuntu/sd_dreamboth" --instance_prompt="[v] shoes" --resolution=512 --train_batch_size=1 --gradient_accumulation_steps=1 --learning_rate=2e-6 --lr_scheduler="constant" --lr_warmup_steps=0 --max_train_steps=600

Logs

No response

System Info

https://cloud.lambdalabs.com/instances: gpu_8x_v100

Dec 13 '22 19:12 hamzafar

ImportError: /home/ubuntu/.cache/torch_extensions/py38_cu118/utils/utils.so: cannot open shared object file: No such file or directory ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 110404) of binary: /usr/bin/python3

looks to me like PyTorch and DeepSpeed are not correctly installed.

Can you run treambooth without deepspeed to begin with?

Dec 19 '22 12:12 patrickvonplaten

Yes, dreambooth is running without deepspeed.

Dec 19 '22 12:12 hamzafar

Ok so it's definitely related to DeepSpeed.

Sorry, I currently don't have the set-up and time to dig deeper into this. cc'ing @williamberman @patil-suraj @pcuenca in case they find some time :-)

Dec 20 '22 00:12 patrickvonplaten

From the stack trace, it looks like deepspeed is not installed correctly, Could you try re-installing it ? Also, note that deepspeed support is experimental, I would recommend using xformers for memory-efficient training.

Dec 23 '22 15:12 patil-suraj

Thank you for the recommendation. I will try using xformers. Previously, I was using deepspeed for model parallelism. Can you provide some guidance on how to accomplish this task?

Dec 29 '22 19:12 hamzafar

Hey @hamzafar! Did some digging on the deepspeed side of things. What's happening here is that one (or more) of the deepspeed kernels is a jit compiled pytorch cpp extension and that one (or more) of those jit compilations are failing. This is why your deepspeed install is initially successful but fails at runtime in the dreambooth training script. Usually when a native extension compilation fails in python (node too!) if you scroll up a bit more in your console there should be some additional logs from the compilation subprocess. I'd recommend investigating there and opening an issue with deepspeed.

Not sure if you're asking for assistance on model parallelism or with using xformers so will take a stab at both :)

For xformers, we provide install instructions here https://huggingface.co/docs/diffusers/optimization/xformers and the dreambooth training script has the --enable_xformers_memory_efficient_attention flag for enabling it.

Re model parallelism w/out deepspeed: I'm probably not the best person on the team to answer. Accelerate does document fsdp here but I'm not familiar with if this would work with the dreambooth training script or if it would be the suggested parallelization strategy (cc @patil-suraj here)

Jan 04 '23 00:01 williamberman

Hi @williamberman thank you for the recommendation. I will check into logs and open the issue with deepspeed.

Actually, I was asking assistance related to model parallelism.

Jan 04 '23 13:01 hamzafar

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Jan 28 '23 15:01 github-actions[bot]

diffusers diffusers copied to clipboard

Training dreambooth on multiple GPUs with deepspeed

Describe the bug

Reproduction

Logs

System Info

diffusers
diffusers copied to clipboard