sd-scripts use multigpu(8*A800 80G) train flux

when i use flux_train.py to train my flux-full-finetune model, i use --optimizer_type adamw8bit & --batch_size 1, this situation always meets OOM.but also, single gpu trainning can use --optimizer_type adamw8bit & --batch_size 8, and the gpu-using size almost 79g. How can i fix the multi-gpu OOM problem? thanks for your replay~

Feb 13 '25 02:02 godwenbin

Some features for VRAM optimization do not work very well with multi-GPU setups, such as --fused_backward_pass. However, 80GB of VRAM should at least work with a batch size of 1. Have you tried --gradient_checkpointing

Feb 13 '25 12:02 Ice-YY

Some features for VRAM optimization do not work very well with multi-GPU setups, such as --fused_backward_pass. However, 80GB of VRAM should at least work with a batch size of 1. Have you tried --gradient_checkpointing

yes，i had used --gradient_checkpointing. Also,I tried different configurations to get it to work but all failed.

Feb 14 '25 07:02 godwenbin

Please provide full config

Feb 18 '25 18:02 dill-shower

This is lack of discussion and experiment (e.g. finetune SDXL with 1x 3090 and 4x 3090 is in different league) I have recently explored most options provided from both accelerate and this repo And made my own attempt in PR #1950 A100 doesn't support FP8 otherwise it can be saved for even more VRAM. You can further crank the deepspeed option by setting ZeRO Stage-3 with FullyShardedDataParallel (FSDP). Good luck

If you need more features, make the flux_train.py from flux_train_netowrk.py by looking my PR (not explored how many codes to be modified yet). However it is useful for large dataset (> 1M) only

Feb 23 '25 23:02 6DammK9

use multigpu(8*A800 80G) train flux_train.py OOM problem