Multi GPU train of flux Have Some Bugs
Multi GPU train flux, Is the script of flux train not supported now?
It supports on sd-3 branch. If you have trouble logs, please attach them.
It supports on sd-3 branch. If you have trouble logs, please attach them.
Now this exception occurs during prepare, but when I modify it to be the same as sdxl, there are other problems at the dataloader.
Unfortunately multi GPU training of FLUX has not been tested yet. --split_mode doesn't seem to work with multi GPU training.
Unfortunately multi GPU training of FLUX has not been tested yet.
--split_modedoesn't seem to work with multi GPU training.
The current single-card training is indeed too slow for flux, especially for fine-tuning at the level of pony or animation. It is beneficial to the open source community to prioritize fixing multi-GPU problems.
Unfortunately multi GPU training of FLUX has not been tested yet.
--split_modedoesn't seem to work with multi GPU training.The current single-card training is indeed too slow for flux, especially for fine-tuning at the level of pony or animation. It is beneficial to the open source community to prioritize fixing multi-GPU problems.
In my toy scripts, training Flux with FSDP in accelerator works.
used accelerate configuration and some code fixing fsdp_config.yaml
compute_environment: LOCAL_MACHINE
distributed_type: FSDP
downcast_bf16: 'no'
fsdp_config:
fsdp_auto_wrap_policy: SIZE_BASED_WRAP
fsdp_backward_prefetch_policy: BACKWARD_PRE
fsdp_forward_prefetch: true
fsdp_offload_params: false
fsdp_sharding_strategy: 1
fsdp_state_dict_type: FULL_STATE_DICT
fsdp_sync_module_states: true
fsdp_use_orig_params: true
fsdp_min_num_params: 100000000
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
in flux_train.py Line 182
from
train_dataset_group.new_cache_latents(ae, accelerator.is_main_process)
to
train_dataset_group.new_cache_latents(ae, True)
If you use caching latents or cached output of text encoder.
And sample scripts here,
accelerate launch --config_file=$ACCELERATE_CONFIG \
--num_processes=$DEVICE_COUNT --num_machines=$NUM_MACHINES --mixed_precision=$MIXED_PRECISION \
--main_process_ip=$MAIN_PROCESS_IP --main_process_port=$MAIN_PROCESS_PORT \
--num_cpu_threads_per_process=2 \
flux_train.py --config_file=$CONFIG_FILE
cf) saving is not work
Updated the sd3 branch. Multi-GPU training should now work.
Updated the sd3 branch. Multi-GPU training should now work.
Kohya you are a God, thank you for the multi-GPU update!!
In my test, fine tuning caused OOM at 1024x1024 resolution in a 48GB VRAM*2 environment and worked at 512x512. If you know how to reduce memory usage, please let me know.
Updated the sd3 branch. Multi-GPU training should now work.
accelerate launch --mixed_precision bf16 --num_cpu_threads_per_process 1 flux_train.py --pretrained_model_name_or_path /workspace/pretrain_model/flux1-dev.safetensors --clip_l /workspace/pretrain_model/clip_l.safetensors --t5xxl /workspace/pretrain_model/t5xxl_fp16.safetensors --ae /workspace/pretrain_model/ae_dev.safetensors --save_model_as safetensors --sdpa --persistent_data_loader_workers --max_data_loader_n_workers 2 --seed 42 --gradient_checkpointing --mixed_precision bf16 --save_precision bf16 --output_dir /workspace/sa_poc --output_name sa --learning_rate 7.5e-6 --max_train_epochs 10 --sdpa --highvram --cache_text_encoder_outputs_to_disk --cache_latents_to_disk --save_every_n_epochs 1 --optimizer_type adafactor --optimizer_args "relative_step=False" "scale_parameter=False" "warmup_init=False" --timestep_sampling sigmoid --model_prediction_type raw --guidance_scale 1.0 --blockwise_fused_optimizers --double_blocks_to_swap 6 --cpu_offload_checkpointing --train_data_dir /workspace/sa_poc/sa --in_json /workspace/sa_poc/lat_small.json --resolution "1024,1024" --train_batch_size 42
I run this command but got an error
Updated the sd3 branch. Multi-GPU training should now work.更新了 sd3 分支。多 GPU 训练现在应该可以工作了。
When I use instructions related to double_blocks, an error will be reported. After canceling, it can run, but the GPU utilization is unstable, and when the batch_size is 4, the memory usage is already very large.
my command
accelerate launch --mixed_precision bf16 --num_cpu_threads_per_process 1 flux_train.py --pretrained_model_name_or_path /workspace/pretrain_model/flux1-dev.safetensors --clip_l /workspace/pretrain_model/clip_l.safetensors --t5xxl /workspace/pretrain_model/t5xxl_fp16.safetensors --ae /workspace/pretrain_model/ae_dev.safetensors --save_model_as safetensors --sdpa --persistent_data_loader_workers --max_data_loader_n_workers 2 --seed 42 --gradient_checkpointing --mixed_precision bf16 --save_precision bf16 --output_dir /workspace/sa_poc --output_name sa --learning_rate 7.5e-6 --max_train_epochs 10 --sdpa --highvram --cache_text_encoder_outputs_to_disk --cache_latents_to_disk --save_every_n_epochs 1 --optimizer_type adafactor --optimizer_args "relative_step=False" "scale_parameter=False" "warmup_init=False" --timestep_sampling sigmoid --model_prediction_type raw --guidance_scale 1.0 --cpu_offload_checkpointing --train_data_dir /workspace/sa_poc/sa --in_json /workspace/sa_poc/lat_small.json --resolution "1024,1024" --train_batch_size 4
or can you give me a command for test
--double_blocks_to_swap and --single_blocks_to_swap cannot be used with multi-GPU training.
--cpu_offload_checkpointing reduces the memory usage, but the GPU utilization seems to be reduced. You may want to remove that option (and decrease the batch size if needed).
--double_blocks_to_swapand--single_blocks_to_swapcannot be used with multi-GPU training.
--cpu_offload_checkpointingreduces the memory usage, but the GPU utilization seems to be reduced. You may want to remove that option (and decrease the batch size if needed).
thank you for your answer fp16 can run, but loss is nan, accelerate also set up fp16
fp16 doesn't seem to be stable. I recommend bf16.
hi @kohya-ss , can I use this feature on both Windows and Linux now? Have you finished developing it?
Has anyone managed to get it working properly? I have 4 GPUs, but it still only runs on the first GPU.
Has anyone managed to get it working properly? I have 4 GPUs, but it still only runs on the first GPU.
Is trainer in caching process?
Has anyone managed to get it working properly? I have 4 GPUs, but it still only runs on the first GPU.
Is trainer in caching process?
I don't think so, but is there a way to check this? @BootsofLagrangian
Has anyone managed to get it working properly? I have 4 GPUs, but it still only runs on the first GPU.
Is trainer in caching process?
I don't think so, but is there a way to check this? @BootsofLagrangian
See terminal you run script, you can find trainer in caching processing or not.
And check your accelerate configuration contains distributed mode is MULTI_GPU and num_processes is number of gpu, in ~/.cache/huggingface/accelerate/default_config.yaml
hi @BootsofLagrangian I am using Windows. Could you please share your accelerate config? And an example of the run script for training? Thank you so much!
hi @BootsofLagrangian I am using Windows. Could you please share your accelerate config? And an example of the run script for training? Thank you so much!
Here are a configuration of accelerate for 4-GPUs with DDP on bf16 mixed precision.
compute_environment: LOCAL_MACHINE
debug: true
distributed_type: MULTI_GPU
downcast_bf16: 'no'
enable_cpu_affinity: false
gpu_ids: all
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
If you want to use it more flexibly, save the above content as ddp.yaml, and then pass it as an argument to accelerate launch like so: accelerate launch --config_file='ddp.yaml' flux_train_network.py --config_file=SCRIPT_FOR_CONFIG.yaml.
And an example config and script for training flux with lora(if you want full finetuning Flux use flux_train.py instead of flux_train_network.py)
pretrained_model_name_or_path = "/path/to/models/sd-3/Flux.1-dev/flux1-dev.safetensors"
clip_l = "/path/to/models/sd-3/Flux.1-dev/clip_l.safetensors"
t5xxl = "/path/to/models/sd-3/Flux.1-dev/t5xxl_fp16.safetensors"
ae = "/path/to/models/sd-3/Flux.1-dev/ae.safetensors"
timestep_sampling = "sigmoid"
model_prediction_type = "raw"
guidance_scale = 1.0
loss_type = "l2"
network_module = "networks.lora_flux"
network_dim = 16
network_alpha = 16
save_state = true
cache_info = true
sdpa = true
fp8_base = true
highvram = true
seed = 42
mixed_precision = "bf16"
save_precision = "bf16"
t5xxl_dtype = "fp16"
output_name = "YOUR_OUTPUT_NAME"
output_dir = "/path/to/output"
logging_dir = "/path/to/logging"
train_data_dir = "/path/to/dataset"
save_every_n_epochs = 1
save_every_n_steps = 100
cache_latents = true
cache_latents_to_disk = true
cache_text_encoder_outputs = true
cache_text_encoder_outputs_to_disk = true
caption_extension = ".txt"
caption_separator = ". "
vae_batch_size = 4
resolution = "1024,1024"
enable_bucket = true
bucket_no_upscale = true
train_batch_size = 4
gradient_accumulation_steps = 1
max_train_epochs = 10
max_data_loader_n_workers = 1
persistent_data_loader_workers = true
gradient_checkpointing = true
learning_rate = 3e-4
network_train_unet_only = true
max_grad_norm = 1.0
optimizer_type = "adamw8bit"
save_model_as = "safetensors"
optimizer_args = [ "weight_decay=1e-1", "betas=(0.95, 0.98)"]
lr_scheduler = "cosine"
lr_warmup_steps = 50
This is an example.yaml file. With this example file run in virtual envrionment, accelerate launch --config_file=ddp.yaml flux_train_network.py --config_file=example.yaml
You should change script configuration for your own path and dataset.
i will try Thank you so much! @BootsofLagrangian
@BootsofLagrangian You mention swapping flux_train.py for flux_train_network.py to do multi-GPU full finetune.
With your config, I can run flux_train_network.py with no problems, but flux_train.py throws an out-of-memory error. Watching nvidia-smi, I can see that there is no VRAM usage on the second GPU before this happens, it doesn't seem to be spreading shards across the two GPUs and instead trying to load everything into the first one.
I have tried this with and without the code change to flux_train.py that you suggested in an earlier comment.
I have two RTX 4090s, and I'm not sure if this is enough to finetune flux without quantization or block swapping, but it seems like it's not working as intended if it's throwing OOM errors without using the second GPU at all.
Were you able to get flux_train.py to work with the config you shared?
--config_file=example.yam
I used this configuration and parameters, but still got an error. This has been bothering me for a long time. How can I solve it
Traceback (most recent call last):
File "e:\pinokio\bin\miniconda\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "e:\pinokio\bin\miniconda\lib\runpy.py", line 86, in run_code
exec(code, run_globals)
File "E:\pinokio\api\fluxgym.git\env\Scripts\accelerate.exe_main.py", line 7, in
accelerate launch --config_file=ddp.yaml flux_train_network.py --config_file=example.yaml
you are my hero, bro.
this script can running on multi-gpu service, about 18G memory
Has this been fixed? I'm getting the same kind of error with this command. I'm trying to run on a setup with 4x RTX4090s...
accelerate launch --mixed_precision bf16 --num_cpu_threads_per_process 1 flux_train.py \
--pretrained_model_name_or_path ${transformer_path} --clip_l $path_to_clip_l --t5xxl $path_to_t5 --ae $path_to_ae \
--save_model_as safetensors --sdpa --persistent_data_loader_workers --max_data_loader_n_workers 2 \
--seed 42 --gradient_checkpointing --mixed_precision bf16 --save_precision bf16 \
--output_dir path/to/output/dir --output_name output-name \
--learning_rate 5e-5 --max_train_epochs 4 --sdpa --highvram --cache_text_encoder_outputs_to_disk --cache_latents_to_disk --save_every_n_epochs 1 \
--optimizer_type adafactor --optimizer_args "relative_step=False" "scale_parameter=False" "warmup_init=False" \
--lr_scheduler constant_with_warmup --max_grad_norm 0.0 \
--timestep_sampling shift --discrete_flow_shift 3.1582 --model_prediction_type raw --guidance_scale 1.0 \
--fused_backward_pass --fp8_base --full_bf16 --dataset_config dataset_1024_bs1.toml --blocks_to_swap 8
FLUX: Block swap enabled. Swapping 8 blocks, double blocks: 4, single blocks: 8.
INFO use Adafactor optimizer | {'relative_step': False, 'scale_parameter': False, 'warmup_init': False} train_util.py:4963
INFO Loaded Flux: <All keys matched successfully> flux_utils.py:137
FLUX: Gradient checkpointing enabled. CPU offload: False
INFO enable block swap: blocks_to_swap=8 flux_train.py:304
FLUX: Block swap enabled. Swapping 8 blocks, double blocks: 4, single blocks: 8.
number of trainable parameters: 11901408320
prepare optimizer, data loader etc.
INFO use Adafactor optimizer | {'relative_step': False, 'scale_parameter': False, 'warmup_init': False} train_util.py:4963
override steps. steps for 4 epochs is / 指定エポックまでのステップ数: 12
enable full bf16 training.
[rank3]: Traceback (most recent call last):
[rank3]: File "/workspace/sd-scripts/flux_train.py", line 850, in <module>
[rank3]: train(args)
[rank3]: File "/workspace/sd-scripts/flux_train.py", line 462, in train
[rank3]: flux = accelerator.prepare(flux, device_placement=[not is_swapping_blocks])
[rank3]: File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 1311, in prepare
[rank3]: result = tuple(
[rank3]: File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 1312, in <genexpr>
[rank3]: self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
[rank3]: File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 1188, in _prepare_one
[rank3]: return self.prepare_model(obj, device_placement=device_placement)
[rank3]: File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 1452, in prepare_model
[rank3]: model = torch.nn.parallel.DistributedDataParallel(
[rank3]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 739, in __init__
[rank3]: self._log_and_throw(
[rank3]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1127, in _log_and_throw
[rank3]: raise err_type(err_msg)
[rank3]: ValueError: DistributedDataParallel device_ids and output_device arguments only work with single-device/multiple-device GPU modules or CPU modules, but got device_ids [3],
Unfortunately --blocks_to_swap is not compatible with multiple GPU training. So I think FLUX.1 multiple GPU training requires 80GB VRAM for each GPU...
Unfortunately
--blocks_to_swapis not compatible with multiple GPU training. So I think FLUX.1 multiple GPU training requires 80GB VRAM for each GPU...
Why is it written that you can finetune Flux under 24GB VRAM? Is this for a single GPU setup?
That's right, a single GPU with 24GB VRAM can fine tune FLUX.1 with block swap.
