Slow iteration/seconds on flux dev lora training
Hi,
I am doing lora training of flux-dev on sd3 branch, it's taking 12-18 it/seconds and increasing the num_workers count is also not helping in training speed, usually its 3-4 it/seconds , can someone help understand this and also if there is a way to improve speed ? I am using NVIDIA A100-SXM4-80GB gpu
@kohya-ss
looks like it might be running on the CPU? would check your accelerate config
Hi @rockerBOO , this is how my config looks, the config targets using GPU's
{"compute_environment": "LOCAL_MACHINE", "debug": false, "distributed_type": "MULTI_GPU", "downcast_bf16": false, "enable_cpu_affinity": false, "machine_rank": 0, "main_training_function": "main", "mixed_precision": "no", "num_machines": 1, "num_processes": 8, "rdzv_backend": "static", "same_network": false, "tpu_use_cluster": false, "tpu_use_sudo": false, "use_cpu": false }
well hard to tell from your screenshot but it say torch.cpu.autocast but i dont know enough about multiple GPU to help more than that
I was getting a similar message but not a slow training like this. And it wasn't using the CPU. I would say it's something different but hard to say.
Could you please share the full log before the starting of the training? The log may have the following lines to show the accelerator's device:
INFO preparing accelerator train_network.py:569
accelerator device: cuda
Could you please share the full log before the starting of the training? The log may have the following lines to show the accelerator's device:
INFO preparing accelerator train_network.py:569 accelerator device: cuda
@kohya-ss I'm getting 8 seconds per iteration on a RTX 3090, I check the logs and is indeed using the cuda:0 (my gpu) device for training, I'm monitoring the memory usage and power usage, and it's the GPU doing the work. why is it so slow?