diffusers
diffusers copied to clipboard
Training dreambooth on multiple GPUs with deepspeed
Describe the bug
Hi,
I am distributing dream booth training on 8 Tesla V100 (16GB) GPUs using deep speed.
I have configured accelerate as below(output of default_config.yaml):
command_file: null commands: null compute_environment: LOCAL_MACHINE deepspeed_config: gradient_accumulation_steps: 1 offload_optimizer_device: cpu offload_param_device: cpu zero3_init_flag: true zero_stage: 2 distributed_type: DEEPSPEED downcast_bf16: 'no' dynamo_backend: 'NO' fsdp_config: {} gpu_ids: null machine_rank: 0 main_process_ip: null main_process_port: null main_training_function: main megatron_lm_config: {} mixed_precision: fp16 num_machines: 1 num_processes: 8 rdzv_backend: static same_network: true tpu_name: null tpu_zone: null use_cpu: false
I am running train_dreambooth.py script as below:
accelerate launch train_dreambooth.py --pretrained_model_name_or_path="CompVis/stable-diffusion-v1-4" --instance_data_dir="/home/ubuntu/shoes" --output_dir="/home/ubuntu/sd_dreamboth" --instance_prompt="[v] shoes" --resolution=512 --train_batch_size=1 --gradient_accumulation_steps=1 --learning_rate=2e-6 --lr_scheduler="constant" --lr_warmup_steps=0 --max_train_steps=600
I can observe the GPUs memory consumption just after executing the above script, but the training stops immediately. and I got follwoing error:
` ninja: build stopped: subcommand failed. Traceback (most recent call last): File "/usr/lib/python3/dist-packages/torch/utils/cpp_extension.py", line 1818, in _run_ninja_build subprocess.run( File "/usr/lib/python3.8/subprocess.py", line 516, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
RuntimeError: Error building extension 'utils'
ImportError: /home/ubuntu/.cache/torch_extensions/py38_cu118/utils/utils.so: cannot open shared object file: No such file or directory
ImportError: /home/ubuntu/.cache/torch_extensions/py38_cu118/utils/utils.so: cannot open shared object file: No such file or directory ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 110404) of binary: /usr/bin/python3
`
Kindly recommend me know, If I am configuring accelerate correctly?
Reproduction
accelerate launch train_dreambooth.py --pretrained_model_name_or_path="CompVis/stable-diffusion-v1-4" --instance_data_dir="/home/ubuntu/shoes" --output_dir="/home/ubuntu/sd_dreamboth" --instance_prompt="[v] shoes" --resolution=512 --train_batch_size=1 --gradient_accumulation_steps=1 --learning_rate=2e-6 --lr_scheduler="constant" --lr_warmup_steps=0 --max_train_steps=600
Logs
No response
System Info
https://cloud.lambdalabs.com/instances: gpu_8x_v100
ImportError: /home/ubuntu/.cache/torch_extensions/py38_cu118/utils/utils.so: cannot open shared object file: No such file or directory ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 110404) of binary: /usr/bin/python3
looks to me like PyTorch and DeepSpeed are not correctly installed.
Can you run treambooth without deepspeed to begin with?
Yes, dreambooth is running without deepspeed.
Ok so it's definitely related to DeepSpeed.
Sorry, I currently don't have the set-up and time to dig deeper into this. cc'ing @williamberman @patil-suraj @pcuenca in case they find some time :-)
From the stack trace, it looks like deepspeed is not installed correctly, Could you try re-installing it ? Also, note that deepspeed support is experimental, I would recommend using xformers
for memory-efficient training.
Thank you for the recommendation. I will try using xformers
. Previously, I was using deepspeed for model parallelism. Can you provide some guidance on how to accomplish this task?
Hey @hamzafar! Did some digging on the deepspeed side of things. What's happening here is that one (or more) of the deepspeed kernels is a jit compiled pytorch cpp extension and that one (or more) of those jit compilations are failing. This is why your deepspeed install is initially successful but fails at runtime in the dreambooth training script. Usually when a native extension compilation fails in python (node too!) if you scroll up a bit more in your console there should be some additional logs from the compilation subprocess. I'd recommend investigating there and opening an issue with deepspeed.
Not sure if you're asking for assistance on model parallelism or with using xformers so will take a stab at both :)
For xformers, we provide install instructions here https://huggingface.co/docs/diffusers/optimization/xformers and the dreambooth training script has the --enable_xformers_memory_efficient_attention
flag for enabling it.
Re model parallelism w/out deepspeed: I'm probably not the best person on the team to answer. Accelerate does document fsdp here but I'm not familiar with if this would work with the dreambooth training script or if it would be the suggested parallelization strategy (cc @patil-suraj here)
Hi @williamberman thank you for the recommendation. I will check into logs and open the issue with deepspeed.
Actually, I was asking assistance related to model parallelism.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.