sd-scripts Multi GPU train of flux Have Some Bugs

Multi GPU train flux, Is the script of flux train not supported now?

Aug 20 '24 07:08 sx0404

It supports on sd-3 branch. If you have trouble logs, please attach them.

Aug 20 '24 09:08 BootsofLagrangian

It supports on sd-3 branch. If you have trouble logs, please attach them.

Snipaste_2024-08-20_17-45-49

Now this exception occurs during prepare, but when I modify it to be the same as sdxl, there are other problems at the dataloader.

Aug 20 '24 09:08 sx0404

Unfortunately multi GPU training of FLUX has not been tested yet. --split_mode doesn't seem to work with multi GPU training.

Aug 20 '24 09:08 kohya-ss

Unfortunately multi GPU training of FLUX has not been tested yet. --split_mode doesn't seem to work with multi GPU training.

The current single-card training is indeed too slow for flux, especially for fine-tuning at the level of pony or animation. It is beneficial to the open source community to prioritize fixing multi-GPU problems.

Aug 20 '24 10:08 sx0404

Unfortunately multi GPU training of FLUX has not been tested yet. --split_mode doesn't seem to work with multi GPU training.

The current single-card training is indeed too slow for flux, especially for fine-tuning at the level of pony or animation. It is beneficial to the open source community to prioritize fixing multi-GPU problems.

In my toy scripts, training Flux with FSDP in accelerator works.

used accelerate configuration and some code fixing fsdp_config.yaml

compute_environment: LOCAL_MACHINE
distributed_type: FSDP
downcast_bf16: 'no'
fsdp_config:
  fsdp_auto_wrap_policy: SIZE_BASED_WRAP
  fsdp_backward_prefetch_policy: BACKWARD_PRE
  fsdp_forward_prefetch: true
  fsdp_offload_params: false
  fsdp_sharding_strategy: 1
  fsdp_state_dict_type: FULL_STATE_DICT
  fsdp_sync_module_states: true
  fsdp_use_orig_params: true
  fsdp_min_num_params: 100000000
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

in flux_train.py Line 182 from train_dataset_group.new_cache_latents(ae, accelerator.is_main_process) to train_dataset_group.new_cache_latents(ae, True)

If you use caching latents or cached output of text encoder.

And sample scripts here,

accelerate launch --config_file=$ACCELERATE_CONFIG \
    --num_processes=$DEVICE_COUNT --num_machines=$NUM_MACHINES --mixed_precision=$MIXED_PRECISION \
    --main_process_ip=$MAIN_PROCESS_IP --main_process_port=$MAIN_PROCESS_PORT \
    --num_cpu_threads_per_process=2 \
    flux_train.py --config_file=$CONFIG_FILE

cf) saving is not work

Aug 21 '24 07:08 BootsofLagrangian

Updated the sd3 branch. Multi-GPU training should now work.

Aug 22 '24 03:08 kohya-ss

Updated the sd3 branch. Multi-GPU training should now work.

Kohya you are a God, thank you for the multi-GPU update!!

Aug 22 '24 03:08 b-7777777

In my test, fine tuning caused OOM at 1024x1024 resolution in a 48GB VRAM*2 environment and worked at 512x512. If you know how to reduce memory usage, please let me know.

Aug 22 '24 03:08 kohya-ss

Updated the sd3 branch. Multi-GPU training should now work.

accelerate launch --mixed_precision bf16 --num_cpu_threads_per_process 1 flux_train.py --pretrained_model_name_or_path /workspace/pretrain_model/flux1-dev.safetensors --clip_l /workspace/pretrain_model/clip_l.safetensors --t5xxl /workspace/pretrain_model/t5xxl_fp16.safetensors --ae /workspace/pretrain_model/ae_dev.safetensors --save_model_as safetensors --sdpa --persistent_data_loader_workers --max_data_loader_n_workers 2 --seed 42 --gradient_checkpointing --mixed_precision bf16 --save_precision bf16 --output_dir /workspace/sa_poc --output_name sa --learning_rate 7.5e-6 --max_train_epochs 10 --sdpa --highvram --cache_text_encoder_outputs_to_disk --cache_latents_to_disk --save_every_n_epochs 1 --optimizer_type adafactor --optimizer_args "relative_step=False" "scale_parameter=False" "warmup_init=False" --timestep_sampling sigmoid --model_prediction_type raw --guidance_scale 1.0 --blockwise_fused_optimizers --double_blocks_to_swap 6 --cpu_offload_checkpointing --train_data_dir /workspace/sa_poc/sa --in_json /workspace/sa_poc/lat_small.json --resolution "1024,1024" --train_batch_size 42

I run this command but got an error

Snipaste_2024-08-22_16-04-10

Aug 22 '24 08:08 sx0404

Updated the sd3 branch. Multi-GPU training should now work.更新了 sd3 分支。多 GPU 训练现在应该可以工作了。

When I use instructions related to double_blocks, an error will be reported. After canceling, it can run, but the GPU utilization is unstable, and when the batch_size is 4, the memory usage is already very large.

Snipaste_2024-08-22_16-12-25

my command

accelerate launch --mixed_precision bf16 --num_cpu_threads_per_process 1 flux_train.py --pretrained_model_name_or_path /workspace/pretrain_model/flux1-dev.safetensors --clip_l /workspace/pretrain_model/clip_l.safetensors --t5xxl /workspace/pretrain_model/t5xxl_fp16.safetensors --ae /workspace/pretrain_model/ae_dev.safetensors --save_model_as safetensors --sdpa --persistent_data_loader_workers --max_data_loader_n_workers 2 --seed 42 --gradient_checkpointing --mixed_precision bf16 --save_precision bf16 --output_dir /workspace/sa_poc --output_name sa --learning_rate 7.5e-6 --max_train_epochs 10 --sdpa --highvram --cache_text_encoder_outputs_to_disk --cache_latents_to_disk --save_every_n_epochs 1 --optimizer_type adafactor --optimizer_args "relative_step=False" "scale_parameter=False" "warmup_init=False" --timestep_sampling sigmoid --model_prediction_type raw --guidance_scale 1.0 --cpu_offload_checkpointing --train_data_dir /workspace/sa_poc/sa --in_json /workspace/sa_poc/lat_small.json --resolution "1024,1024" --train_batch_size 4

or can you give me a command for test

Aug 22 '24 08:08 sx0404

--double_blocks_to_swap and --single_blocks_to_swap cannot be used with multi-GPU training.

--cpu_offload_checkpointing reduces the memory usage, but the GPU utilization seems to be reduced. You may want to remove that option (and decrease the batch size if needed).

Aug 22 '24 09:08 kohya-ss

--double_blocks_to_swap and --single_blocks_to_swap cannot be used with multi-GPU training.

--cpu_offload_checkpointing reduces the memory usage, but the GPU utilization seems to be reduced. You may want to remove that option (and decrease the batch size if needed).

thank you for your answer fp16 can run, but loss is nan, accelerate also set up fp16

Snipaste_2024-08-22_18-02-49

Aug 22 '24 10:08 sx0404

fp16 doesn't seem to be stable. I recommend bf16.

Aug 22 '24 10:08 kohya-ss

hi @kohya-ss , can I use this feature on both Windows and Linux now? Have you finished developing it?

Aug 28 '24 01:08 phucbienvan

Has anyone managed to get it working properly? I have 4 GPUs, but it still only runs on the first GPU.

Aug 28 '24 07:08 phucbienvan

Has anyone managed to get it working properly? I have 4 GPUs, but it still only runs on the first GPU.

Is trainer in caching process?

Aug 28 '24 08:08 BootsofLagrangian

Has anyone managed to get it working properly? I have 4 GPUs, but it still only runs on the first GPU.

Is trainer in caching process?

I don't think so, but is there a way to check this? @BootsofLagrangian

Aug 28 '24 08:08 phucbienvan

Has anyone managed to get it working properly? I have 4 GPUs, but it still only runs on the first GPU.

Is trainer in caching process?

I don't think so, but is there a way to check this? @BootsofLagrangian

See terminal you run script, you can find trainer in caching processing or not.

And check your accelerate configuration contains distributed mode is MULTI_GPU and num_processes is number of gpu, in ~/.cache/huggingface/accelerate/default_config.yaml

Aug 28 '24 08:08 BootsofLagrangian

hi @BootsofLagrangian I am using Windows. Could you please share your accelerate config? And an example of the run script for training? Thank you so much!

Aug 28 '24 08:08 phucbienvan

hi @BootsofLagrangian I am using Windows. Could you please share your accelerate config? And an example of the run script for training? Thank you so much!

Here are a configuration of accelerate for 4-GPUs with DDP on bf16 mixed precision.

compute_environment: LOCAL_MACHINE
debug: true
distributed_type: MULTI_GPU
downcast_bf16: 'no'
enable_cpu_affinity: false
gpu_ids: all
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false

If you want to use it more flexibly, save the above content as ddp.yaml, and then pass it as an argument to accelerate launch like so: accelerate launch --config_file='ddp.yaml' flux_train_network.py --config_file=SCRIPT_FOR_CONFIG.yaml.

And an example config and script for training flux with lora(if you want full finetuning Flux use flux_train.py instead of flux_train_network.py)

pretrained_model_name_or_path = "/path/to/models/sd-3/Flux.1-dev/flux1-dev.safetensors"
clip_l = "/path/to/models/sd-3/Flux.1-dev/clip_l.safetensors"
t5xxl = "/path/to/models/sd-3/Flux.1-dev/t5xxl_fp16.safetensors"
ae = "/path/to/models/sd-3/Flux.1-dev/ae.safetensors"

timestep_sampling = "sigmoid"
model_prediction_type = "raw"
guidance_scale = 1.0
loss_type = "l2"

network_module = "networks.lora_flux"
network_dim = 16
network_alpha = 16

save_state = true
cache_info = true
sdpa = true
fp8_base = true
highvram = true
seed = 42

mixed_precision = "bf16"
save_precision = "bf16"
t5xxl_dtype = "fp16"

output_name = "YOUR_OUTPUT_NAME"
output_dir = "/path/to/output"
logging_dir = "/path/to/logging"
train_data_dir = "/path/to/dataset"
save_every_n_epochs = 1
save_every_n_steps = 100

cache_latents = true
cache_latents_to_disk = true
cache_text_encoder_outputs = true
cache_text_encoder_outputs_to_disk = true

caption_extension = ".txt"
caption_separator = ". "

vae_batch_size = 4

resolution = "1024,1024"
enable_bucket = true
bucket_no_upscale = true

train_batch_size = 4
gradient_accumulation_steps = 1
max_train_epochs = 10
max_data_loader_n_workers = 1
persistent_data_loader_workers = true
gradient_checkpointing = true

learning_rate = 3e-4
network_train_unet_only = true

max_grad_norm = 1.0
optimizer_type = "adamw8bit"
save_model_as = "safetensors"
optimizer_args = [ "weight_decay=1e-1", "betas=(0.95, 0.98)"]
lr_scheduler = "cosine"
lr_warmup_steps = 50

This is an example.yaml file. With this example file run in virtual envrionment, accelerate launch --config_file=ddp.yaml flux_train_network.py --config_file=example.yaml You should change script configuration for your own path and dataset.

Aug 28 '24 11:08 BootsofLagrangian

i will try Thank you so much! @BootsofLagrangian

Aug 28 '24 14:08 phucbienvan

@BootsofLagrangian You mention swapping flux_train.py for flux_train_network.py to do multi-GPU full finetune.

With your config, I can run flux_train_network.py with no problems, but flux_train.py throws an out-of-memory error. Watching nvidia-smi, I can see that there is no VRAM usage on the second GPU before this happens, it doesn't seem to be spreading shards across the two GPUs and instead trying to load everything into the first one.

I have tried this with and without the code change to flux_train.py that you suggested in an earlier comment.

I have two RTX 4090s, and I'm not sure if this is enough to finetune flux without quantization or block swapping, but it seems like it's not working as intended if it's throwing OOM errors without using the second GPU at all.

Were you able to get flux_train.py to work with the config you shared?

Sep 08 '24 20:09 peteallen

--config_file=example.yam

I used this configuration and parameters, but still got an error. This has been bothering me for a long time. How can I solve it

Traceback (most recent call last): File "e:\pinokio\bin\miniconda\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "e:\pinokio\bin\miniconda\lib\runpy.py", line 86, in run_code exec(code, run_globals) File "E:\pinokio\api\fluxgym.git\env\Scripts\accelerate.exe_main.py", line 7, in File "e:\pinokio\api\fluxgym.git\env\lib\site-packages\accelerate\commands\accelerate_cli.py", line 48, in main args.func(args) File "e:\pinokio\api\fluxgym.git\env\lib\site-packages\accelerate\commands\launch.py", line 1159, in launch_command multi_gpu_launcher(args) File "e:\pinokio\api\fluxgym.git\env\lib\site-packages\accelerate\commands\launch.py", line 793, in multi_gpu_launcher distrib_run.run(args) File "e:\pinokio\api\fluxgym.git\env\lib\site-packages\torch\distributed\run.py", line 910, in run elastic_launch( File "e:\pinokio\api\fluxgym.git\env\lib\site-packages\torch\distributed\launcher\api.py", line 138, in call return launch_agent(self._config, self._entrypoint, list(args)) File "e:\pinokio\api\fluxgym.git\env\lib\site-packages\torch\distributed\launcher\api.py", line 260, in launch_agent result = agent.run() File "e:\pinokio\api\fluxgym.git\env\lib\site-packages\torch\distributed\elastic\metrics\api.py", line 137, in wrapper result = f(*args, **kwargs) File "e:\pinokio\api\fluxgym.git\env\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 696, in run result = self._invoke_run(role) File "e:\pinokio\api\fluxgym.git\env\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 849, in _invoke_run self._initialize_workers(self._worker_group) File "e:\pinokio\api\fluxgym.git\env\lib\site-packages\torch\distributed\elastic\metrics\api.py", line 137, in wrapper result = f(*args, **kwargs) File "e:\pinokio\api\fluxgym.git\env\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 668, in _initialize_workers self._rendezvous(worker_group) File "e:\pinokio\api\fluxgym.git\env\lib\site-packages\torch\distributed\elastic\metrics\api.py", line 137, in wrapper result = f(*args, **kwargs) File "e:\pinokio\api\fluxgym.git\env\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 500, in _rendezvous rdzv_info = spec.rdzv_handler.next_rendezvous() File "e:\pinokio\api\fluxgym.git\env\lib\site-packages\torch\distributed\elastic\rendezvous\static_tcp_rendezvous.py", line 67, in next_rendezvous self._store = TCPStore( # type: ignore[call-arg] RuntimeError: use_libuv was requested but PyTorch was build without libuv support

Sep 21 '24 09:09 jinwei1660

accelerate launch --config_file=ddp.yaml flux_train_network.py --config_file=example.yaml

you are my hero, bro.

this script can running on multi-gpu service, about 18G memory

Nov 01 '24 10:11 zixuzhuang

Has this been fixed? I'm getting the same kind of error with this command. I'm trying to run on a setup with 4x RTX4090s...


accelerate launch  --mixed_precision bf16 --num_cpu_threads_per_process 1 flux_train.py \
--pretrained_model_name_or_path ${transformer_path}  --clip_l $path_to_clip_l --t5xxl $path_to_t5 --ae $path_to_ae \
--save_model_as safetensors --sdpa --persistent_data_loader_workers --max_data_loader_n_workers 2 \
--seed 42 --gradient_checkpointing --mixed_precision bf16 --save_precision bf16 \
--output_dir path/to/output/dir --output_name output-name \
--learning_rate 5e-5 --max_train_epochs 4  --sdpa --highvram --cache_text_encoder_outputs_to_disk --cache_latents_to_disk --save_every_n_epochs 1 \
--optimizer_type adafactor --optimizer_args "relative_step=False" "scale_parameter=False" "warmup_init=False" \
--lr_scheduler constant_with_warmup --max_grad_norm 0.0 \
--timestep_sampling shift --discrete_flow_shift 3.1582 --model_prediction_type raw --guidance_scale 1.0 \
--fused_backward_pass  --fp8_base --full_bf16 --dataset_config dataset_1024_bs1.toml --blocks_to_swap 8

FLUX: Block swap enabled. Swapping 8 blocks, double blocks: 4, single blocks: 8.
                    INFO     use Adafactor optimizer | {'relative_step': False, 'scale_parameter': False, 'warmup_init': False}                                   train_util.py:4963
                    INFO     Loaded Flux: <All keys matched successfully>                                                                                          flux_utils.py:137
FLUX: Gradient checkpointing enabled. CPU offload: False
                    INFO     enable block swap: blocks_to_swap=8                                                                                                   flux_train.py:304
FLUX: Block swap enabled. Swapping 8 blocks, double blocks: 4, single blocks: 8.
number of trainable parameters: 11901408320
prepare optimizer, data loader etc.
                    INFO     use Adafactor optimizer | {'relative_step': False, 'scale_parameter': False, 'warmup_init': False}                                   train_util.py:4963
override steps. steps for 4 epochs is / 指定エポックまでのステップ数: 12
enable full bf16 training.
[rank3]: Traceback (most recent call last):
[rank3]:   File "/workspace/sd-scripts/flux_train.py", line 850, in <module>
[rank3]:     train(args)
[rank3]:   File "/workspace/sd-scripts/flux_train.py", line 462, in train
[rank3]:     flux = accelerator.prepare(flux, device_placement=[not is_swapping_blocks])
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 1311, in prepare
[rank3]:     result = tuple(
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 1312, in <genexpr>
[rank3]:     self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 1188, in _prepare_one
[rank3]:     return self.prepare_model(obj, device_placement=device_placement)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 1452, in prepare_model
[rank3]:     model = torch.nn.parallel.DistributedDataParallel(
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 739, in __init__
[rank3]:     self._log_and_throw(
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1127, in _log_and_throw
[rank3]:     raise err_type(err_msg)
[rank3]: ValueError: DistributedDataParallel device_ids and output_device arguments only work with single-device/multiple-device GPU modules or CPU modules, but got device_ids [3],

May 20 '25 19:05 frutiemax92

Unfortunately --blocks_to_swap is not compatible with multiple GPU training. So I think FLUX.1 multiple GPU training requires 80GB VRAM for each GPU...

May 20 '25 22:05 kohya-ss

Unfortunately --blocks_to_swap is not compatible with multiple GPU training. So I think FLUX.1 multiple GPU training requires 80GB VRAM for each GPU...

Why is it written that you can finetune Flux under 24GB VRAM? Is this for a single GPU setup?

May 21 '25 00:05 frutiemax92

That's right, a single GPU with 24GB VRAM can fine tune FLUX.1 with block swap.

May 21 '25 08:05 kohya-ss