sd-scripts Slow iteration/seconds on flux dev lora training

Hi,

I am doing lora training of flux-dev on sd3 branch, it's taking 12-18 it/seconds and increasing the num_workers count is also not helping in training speed, usually its 3-4 it/seconds , can someone help understand this and also if there is a way to improve speed ? I am using NVIDIA A100-SXM4-80GB gpu

@kohya-ss

Mar 31 '25 22:03 asinghan2603

looks like it might be running on the CPU? would check your accelerate config

Apr 01 '25 02:04 rockerBOO

Hi @rockerBOO , this is how my config looks, the config targets using GPU's

{"compute_environment": "LOCAL_MACHINE", "debug": false, "distributed_type": "MULTI_GPU", "downcast_bf16": false, "enable_cpu_affinity": false, "machine_rank": 0, "main_training_function": "main", "mixed_precision": "no", "num_machines": 1, "num_processes": 8, "rdzv_backend": "static", "same_network": false, "tpu_use_cluster": false, "tpu_use_sudo": false, "use_cpu": false }

Apr 01 '25 04:04 asinghan2603

well hard to tell from your screenshot but it say torch.cpu.autocast but i dont know enough about multiple GPU to help more than that

Apr 01 '25 04:04 rockerBOO

I was getting a similar message but not a slow training like this. And it wasn't using the CPU. I would say it's something different but hard to say.

Apr 03 '25 20:04 rockerBOO

Could you please share the full log before the starting of the training? The log may have the following lines to show the accelerator's device:

                    INFO     preparing accelerator                                                       train_network.py:569
accelerator device: cuda

Apr 05 '25 11:04 kohya-ss

Could you please share the full log before the starting of the training? The log may have the following lines to show the accelerator's device:
                    INFO     preparing accelerator                                                       train_network.py:569
accelerator device: cuda

@kohya-ss I'm getting 8 seconds per iteration on a RTX 3090, I check the logs and is indeed using the cuda:0 (my gpu) device for training, I'm monitoring the memory usage and power usage, and it's the GPU doing the work. why is it so slow?

Jul 26 '25 07:07 LoRAMilkshake