sd-scripts Multi GPU train of flux report error

I use this setting below to train flux lora:

accelerate launch  --gpu_ids 0,1 --main_process_port 29502 --mixed_precision bf16 --num_cpu_threads_per_process=2 \
    flux_train_network.py --pretrained_model_name_or_path ${flux_model_path} \
    --clip_l ${clip_l_path} --t5xxl ${t5xxl_path}  --ae ${ae_path} \
    --cache_latents_to_disk --save_model_as safetensors --sdpa --persistent_data_loader_workers \
    --max_data_loader_n_workers 2 --seed 42 \
    --gradient_checkpointing \
    --save_precision bf16 --mixed_precision bf16 \
    --network_module networks.lora_flux \
    --network_dim 16 \
    --optimizer_type prodigy \
    --learning_rate 1 --network_train_unet_only \
    --cache_text_encoder_outputs --cache_text_encoder_outputs_to_disk \
    --highvram \
    --max_train_epochs 10   \
    --save_every_n_epochs 1 \
    --train_data_dir=${input_path} \
    --output_dir ${output_path}  \
    --output_name flux_shot \
    --timestep_sampling sigmoid --model_prediction_type raw --guidance_scale 1 --loss_type l2 \
    --resolution="1024,1024" --bucket_reso_steps=64 --bucket_no_upscale --min_bucket_reso=256 --max_bucket_reso=2048 --enable_bucket \
    --caption_extension=".txt" \
    --lr_scheduler="cosine" --lr_warmup_steps=396 --train_batch_size=4 --deepspeed --zero_stage=2 \
    --log_with="wandb" --wandb_run_name="shot2" --wandb_api_key="" --logging_dir=${output_path}"/logs" --log_tracker_name="flux_lora1"

it will report the error like this :

Aug 19 '24 07:08 chongxian

Updated the sd3 branch. Multi-GPU training should now work. Please report again if the issue remains.

Aug 22 '24 03:08 kohya-ss

The error was not caused by DDP multi-gpu training, but by DeepSpeed... The original DDP multi-gpu training was fine with Flux Lora training, but as long as you installed DeepSpeed and enabled in your training script by adding --deepspeed --zero_stage=2, it would throw an error "RuntimeError: mat1 and mat2 must have the same dtype, but got Float and BFloat16". Even though you configured mixed_precision for both accelerate and scripts with bf16, the error still existed. Maybe @BootsofLagrangian would like to take a look. Thanks.

Aug 24 '24 16:08 terrificdm

The error was not caused by DDP multi-gpu training, but by DeepSpeed... The original DDP multi-gpu training was fine with Flux Lora training, but as long as you installed DeepSpeed and enabled in your training script by adding --deepspeed --zero_stage=2, it would throw an error "RuntimeError: mat1 and mat2 must have the same dtype, but got Float and BFloat16". Even though you configured mixed_precision for both accelerate and scripts with bf16, the error still existed. Maybe @BootsofLagrangian would like to take a look. Thanks.

Thank you for noticing. I'll check it out and add a comment.

Aug 25 '24 03:08 BootsofLagrangian

The error was not caused by DDP multi-gpu training, but by DeepSpeed... The original DDP multi-gpu training was fine with Flux Lora training, but as long as you installed DeepSpeed and enabled in your training script by adding --deepspeed --zero_stage=2, it would throw an error "RuntimeError: mat1 and mat2 must have the same dtype, but got Float and BFloat16". Even though you configured mixed_precision for both accelerate and scripts with bf16, the error still existed. Maybe @BootsofLagrangian would like to take a look. Thanks.

Thank you for noticing. I'll check it out and add a comment. would you have idea to solve this problem?

Aug 29 '24 06:08 chongxian

Updated the sd3 branch. Multi-GPU training should now work. Please report again if the issue remains.

I have four A100-40G,Is it feasible to train flux model with multiple graphics cards?I've been having problems with OOM, but when I add this command like --deepspeed --zero_stage=2 --offload_optimizer_device="cpu" , it will report the same errors like "RuntimeError: mat1 and mat2 must have the same dtype, but got Float and BFloat16"

Aug 29 '24 06:08 chongxian

Updated the sd3 branch. Multi-GPU training should now work. Please report again if the issue remains.

I have four A100-40G,Is it feasible to train flux model with multiple graphics cards?I've been having problems with OOM, but when I add this command like --deepspeed --zero_stage=2 --offload_optimizer_device="cpu" , it will report the same errors like "RuntimeError: mat1 and mat2 must have the same dtype, but got Float and BFloat16"

this my complete command

accelerate launch  --gpu_ids 0,1,2 --mixed_precision bf16 --num_cpu_threads_per_process 3 flux_train.py \
    --pretrained_model_name_or_path ${flux_model_path} --clip_l ${clip_l_path} --t5xxl ${t5xxl_path} --ae ${ae_path} --save_model_as safetensors \
    --sdpa --persistent_data_loader_workers --max_data_loader_n_workers 2 --seed 42 --gradient_checkpointing --mixed_precision bf16 --save_precision bf16 \
    --output_dir ${output_path} --output_name flux_dev  --highvram --cache_text_encoder_outputs_to_disk --cache_latents_to_disk --save_every_n_epochs 1  \
    --learning_rate 5e-5 --max_train_epochs 10 \
    --optimizer_type adamw8bit  \
    --timestep_sampling sigmoid --model_prediction_type raw --guidance_scale 1.0 \
    --cpu_offload_checkpointing \
    --resolution="1024,1024" --bucket_reso_steps=64 --bucket_no_upscale --min_bucket_reso=256 --max_bucket_reso=2048 --enable_bucket \
    --train_data_dir=${input_path} --caption_extension=".txt"  \
    --deepspeed --zero_stage=2 --full_bf16  --gradient_accumulation_steps=1 --cache_latents --offload_optimizer_device="cpu"

Aug 29 '24 06:08 chongxian

Updated the sd3 branch. Multi-GPU training should now work. Please report again if the issue remains.

I have four A100-40G,Is it feasible to train flux model with multiple graphics cards?I've been having problems with OOM, but when I add this command like --deepspeed --zero_stage=2 --offload_optimizer_device="cpu" , it will report the same errors like "RuntimeError: mat1 and mat2 must have the same dtype, but got Float and BFloat16"

In sd-scripts, autocast is enabled. But for some reason I don't know, autocasting for deepspeed model is not work. I think it takes times. But, other implemantion of ZeRO, FSDP, not implemented in sd-script, works.

Aug 29 '24 07:08 BootsofLagrangian

Updated the sd3 branch. Multi-GPU training should now work. Please report again if the issue remains.

I have four A100-40G,Is it feasible to train flux model with multiple graphics cards?I've been having problems with OOM, but when I add this command like --deepspeed --zero_stage=2 --offload_optimizer_device="cpu" , it will report the same errors like "RuntimeError: mat1 and mat2 must have the same dtype, but got Float and BFloat16"

In sd-scripts, autocast is enabled. But for some reason I don't know, autocasting for deepspeed model is not work. I think it takes times. But, other implemantion of ZeRO, FSDP, not implemented in sd-script, works.

how to modify my command to run the flux_train.py?

Aug 29 '24 07:08 chongxian

Updated the sd3 branch. Multi-GPU training should now work. Please report again if the issue remains.

I have four A100-40G,Is it feasible to train flux model with multiple graphics cards?I've been having problems with OOM, but when I add this command like --deepspeed --zero_stage=2 --offload_optimizer_device="cpu" , it will report the same errors like "RuntimeError: mat1 and mat2 must have the same dtype, but got Float and BFloat16"

this my complete command

accelerate launch  --gpu_ids 0,1,2 --mixed_precision bf16 --num_cpu_threads_per_process 3 flux_train.py \
    --pretrained_model_name_or_path ${flux_model_path} --clip_l ${clip_l_path} --t5xxl ${t5xxl_path} --ae ${ae_path} --save_model_as safetensors \
    --sdpa --persistent_data_loader_workers --max_data_loader_n_workers 2 --seed 42 --gradient_checkpointing --mixed_precision bf16 --save_precision bf16 \
    --output_dir ${output_path} --output_name flux_dev  --highvram --cache_text_encoder_outputs_to_disk --cache_latents_to_disk --save_every_n_epochs 1  \
    --learning_rate 5e-5 --max_train_epochs 10 \
    --optimizer_type adamw8bit  \
    --timestep_sampling sigmoid --model_prediction_type raw --guidance_scale 1.0 \
    --cpu_offload_checkpointing \
    --resolution="1024,1024" --bucket_reso_steps=64 --bucket_no_upscale --min_bucket_reso=256 --max_bucket_reso=2048 --enable_bucket \
    --train_data_dir=${input_path} --caption_extension=".txt"  \
    --deepspeed --zero_stage=2 --full_bf16  --gradient_accumulation_steps=1 --cache_latents --offload_optimizer_device="cpu"

Regarding flux_train.py, even remove --deepspeed --zero_stage=2, and just use the original DDP multi-gpu training of the script, you will still see the OOM error, no matter how you optimize your configurations as kohya mentioned in notes of flux_train.py, which is becaused some configurations only work for 1 GPU condition.

Aug 29 '24 07:08 terrificdm

Updated the sd3 branch. Multi-GPU training should now work. Please report again if the issue remains.

I have four A100-40G,Is it feasible to train flux model with multiple graphics cards?I've been having problems with OOM, but when I add this command like --deepspeed --zero_stage=2 --offload_optimizer_device="cpu" , it will report the same errors like "RuntimeError: mat1 and mat2 must have the same dtype, but got Float and BFloat16"

this my complete command
accelerate launch  --gpu_ids 0,1,2 --mixed_precision bf16 --num_cpu_threads_per_process 3 flux_train.py \
    --pretrained_model_name_or_path ${flux_model_path} --clip_l ${clip_l_path} --t5xxl ${t5xxl_path} --ae ${ae_path} --save_model_as safetensors \
    --sdpa --persistent_data_loader_workers --max_data_loader_n_workers 2 --seed 42 --gradient_checkpointing --mixed_precision bf16 --save_precision bf16 \
    --output_dir ${output_path} --output_name flux_dev  --highvram --cache_text_encoder_outputs_to_disk --cache_latents_to_disk --save_every_n_epochs 1  \
    --learning_rate 5e-5 --max_train_epochs 10 \
    --optimizer_type adamw8bit  \
    --timestep_sampling sigmoid --model_prediction_type raw --guidance_scale 1.0 \
    --cpu_offload_checkpointing \
    --resolution="1024,1024" --bucket_reso_steps=64 --bucket_no_upscale --min_bucket_reso=256 --max_bucket_reso=2048 --enable_bucket \
    --train_data_dir=${input_path} --caption_extension=".txt"  \
    --deepspeed --zero_stage=2 --full_bf16  --gradient_accumulation_steps=1 --cache_latents --offload_optimizer_device="cpu"
Regarding flux_train.py, even remove --deepspeed --zero_stage=2, and just use the original DDP multi-gpu training of the script, you will still see the OOM error, no matter how you optimize your configurations as kohya mentioned in notes of flux_train.py, which is becaused some configurations only work for 1 GPU condition.

i also meet the problem, but when i used the same command to sdxl_train.py, it is ok, so i think the flux has some problem with deepspeed

Aug 29 '24 09:08 Ethan-niu

The error was not caused by DDP multi-gpu training, but by DeepSpeed... The original DDP multi-gpu training was fine with Flux Lora training, but as long as you installed DeepSpeed and enabled in your training script by adding --deepspeed --zero_stage=2, it would throw an error "RuntimeError: mat1 and mat2 must have the same dtype, but got Float and BFloat16". Even though you configured mixed_precision for both accelerate and scripts with bf16, the error still existed. Maybe @BootsofLagrangian would like to take a look. Thanks.

i also meet the problem, but when i used the same command to sdxl_train.py, it is ok, so i think the flux has some problem with deepspeed

Aug 29 '24 09:08 Ethan-niu

Updated the sd3 branch. Multi-GPU training should now work. Please report again if the issue remains.

I have four A100-40G,Is it feasible to train flux model with multiple graphics cards?I've been having problems with OOM, but when I add this command like --deepspeed --zero_stage=2 --offload_optimizer_device="cpu" , it will report the same errors like "RuntimeError: mat1 and mat2 must have the same dtype, but got Float and BFloat16"

In sd-scripts, autocast is enabled. But for some reason I don't know, autocasting for deepspeed model is not work. I think it takes times. But, other implemantion of ZeRO, FSDP, not implemented in sd-script, works.

how to modify my command to run the flux_train.py?

DDP or Deepspeed? If you wanna try to run flux_train.py with DDP, you have to fix some codes like this. In case of Deepspeed, still in working.

Aug 29 '24 11:08 BootsofLagrangian

DDP seems to consume a lot of memory. I guess it's because the number of parameters is so large that the synchronization overhead is large, but I don't know why, so if anyone knows, please let me know.

Regarding flux_train.py, even remove --deepspeed --zero_stage=2, and just use the original DDP multi-gpu training of the script, you will still see the OOM error, no matter how you optimize your configurations as kohya mentioned in notes of flux_train.py, which is becaused some configurations only work for 1 GPU condition.

Could you try reducing the resolution to about 512x512?

Aug 29 '24 12:08 kohya-ss

DDP or Deepspeed? If you wanna try to run flux_train.py with DDP, you have to fix some codes like this. In case of Deepspeed, still in working.

I think this issue is solved.

Aug 29 '24 12:08 kohya-ss

DDP seems to consume a lot of memory. I guess it's because the number of parameters is so large that the synchronization overhead is large, but I don't know why, so if anyone knows, please let me know.

Regarding flux_train.py, even remove --deepspeed --zero_stage=2, and just use the original DDP multi-gpu training of the script, you will still see the OOM error, no matter how you optimize your configurations as kohya mentioned in notes of flux_train.py, which is becaused some configurations only work for 1 GPU condition.

Could you try reducing the resolution to about 512x512?

I tried the resolution 512 with A100 80G, but still OOM

Aug 30 '24 02:08 Ethan-niu

The following options might work: --sdpa --optimizer_type adafactor --optimizer_args relative_step=False scale_parameter=False warmup_init=False --full_bf16 --ddp_gradient_as_bucket --ddp_static_graph --cpu_offload_checkpointing --fused_backward_pass

Aug 30 '24 12:08 kohya-ss

The following options might work: --sdpa --optimizer_type adafactor --optimizer_args relative_step=False scale_parameter=False warmup_init=False --full_bf16 --ddp_gradient_as_bucket --ddp_static_graph --cpu_offload_checkpointing --fused_backward_pass

Thank you very much, but i want to know when can use deepseed to finetune flux, using DDP can only with small batch and resulution

Sep 02 '24 08:09 Ethan-niu

I'm not familiar with DeepSpeed so it will probably take a while.

Sep 02 '24 14:09 kohya-ss

I'm not familiar with DeepSpeed so it will probably take a while. thank you, i found when train using 1 gpu,the vram only used 40g,but using the same config, two gpu training, the vram used 80g, why?

Sep 03 '24 07:09 Ethan-niu

Updated the sd3 branch. Multi-GPU training should now work. Please report again if the issue remains.

I have four A100-40G,Is it feasible to train flux model with multiple graphics cards?I've been having problems with OOM, but when I add this command like --deepspeed --zero_stage=2 --offload_optimizer_device="cpu" , it will report the same errors like "RuntimeError: mat1 and mat2 must have the same dtype, but got Float and BFloat16"

In sd-scripts, autocast is enabled. But for some reason I don't know, autocasting for deepspeed model is not work. I think it takes times. But, other implemantion of ZeRO, FSDP, not implemented in sd-script, works.

I trained sdxl with deepspeed is ok by sd-scripts, but flux is not, so i think the flux code has some bugs?

Sep 05 '24 03:09 Ethan-niu

@BootsofLagrangian Hi there, hope that i can reach out to you. I also get this dtype error when training flux lora with deepspeed multigpu. Do you have any updates maybe no what it might be? Thank you for your time!

RuntimeError: mat1 and mat2 must have the same dtype, but got Float and BFloat16

Nov 08 '24 14:11 kunibald413

@BootsofLagrangian Hi there, hope that i can reach out to you. I also get this dtype error when training flux lora with deepspeed multigpu. Do you have any updates maybe no what it might be? Thank you for your time!
RuntimeError: mat1 and mat2 must have the same dtype, but got Float and BFloat16

This is quite tricky problem. It might be cause by input of models (probably cached token, Float), but autocast must handle this in context manager. Sorry for this

Nov 10 '24 06:11 BootsofLagrangian

@terrificdm With the RTX3090(24GB), image resolution=1024 condition, the multi-gpu Flux finetuning is OOM, have you ever occur the same problem? Thanks a lot！My config script is :

Nov 18 '24 07:11 wanglaofei

@terrificdm With the RTX3090(24GB), image resolution=1024 condition, the multi-gpu Flux finetuning is OOM, have you ever occur the same problem? Thanks a lot！My config script is :

I have the same issue, I think this is related to GPU size?

Nov 22 '24 16:11 yurujaja

The deepspeed mode is still not working. When using deepspeed mode with two gpus, it occurs: But directly using DDP with two gpus, it works.

Jan 08 '25 09:01 wanglaofei

I'm not familiar with DeepSpeed so it will probably take a while. thank you, i found when train using 1 gpu,the vram only used 40g,but using the same config, two gpu training, the vram used 80g, why?

May I ask if you have solved this problem?I have the same problem you have

Jan 17 '25 02:01 zideliu

@terrificdm With the RTX3090(24GB), image resolution=1024 condition, the multi-gpu Flux finetuning is OOM, have you ever occur the same problem? Thanks a lot！My config script is :

I have the same issue, I think this is related to GPU size?

flux 1 has 11B parameters.

Technically, it requires 11B x 2bytes (bf16) + optimizer state (adafactor ~ \sqrt{N}, adam: 2 * 11B * 4bytes (fp32 adam) ) + gradients 11B x 2bytes or 4bytes > 24GB (single RTX 3090, and DDP)

When I tested deepspeed, I used several GPUs ( 4x, 8x )

The deepspeed mode is still not working. When using deepspeed mode with two gpus, it occurs: But directly using DDP with two gpus, it works.

Unfortunately, I don't have much time. Direct casting latents into model's dtype(bf16) might solve this, but I'm not sure.

I'm not familiar with DeepSpeed so it will probably take a while. thank you, i found when train using 1 gpu,the vram only used 40g,but using the same config, two gpu training, the vram used 80g, why?

May I ask if you have solved this problem?I have the same problem you have

I think this might be caused by reducing, communication btw GPUs. In multi-gpu setting, they have to calculate gradients and share this across all gpus.

Jan 20 '25 05:01 BootsofLagrangian