ai-toolkit How to train Flux Lora on multiple GPUs?

I need help training Flux Lora on multiple GPUs. The memory on a single GPU is not sufficient, so I want to train on multiple GPUs. However, configuring device: cuda:0,1 in the config file doesn't seem to work.

Could you please provide guidance on how to properly set up and run Flux Lora training across multiple GPUs? The current single-GPU memory limitation is preventing me from training effectively.

Any assistance or examples of multi-GPU configurations for Flux Lora would be greatly appreciated. Thank you!

Aug 16 '24 03:08 IAn2018cs

I have the same problem. Have you solved it?

Aug 16 '24 08:08 asizk

i'm currently making changes to the scripts on my end to run mult-gpus. I have quite a bit of requests and 1 gpu doesn't cut it. I know that the kohya version of flux can run multi-gpus

Aug 16 '24 21:08 WarAnakin

Also looking into this!

Aug 17 '24 07:08 cuba6112

Same problem here.

Aug 19 '24 21:08 Eng-ZeyadTarek

yeah same issue, testing on one GPU is working great but can't see myself using this in the future without multi GPU

Aug 20 '24 03:08 skein12

One way I see to train on multiple GPUs at once is to create several .yaml files, each with a different GPU and a different part of the dataset. This would require splitting the dataset into multiple parts and then, after training, combining the resulting .safetensors weights into a single file. I wouldn’t know how to do that merge.

However, the ideal solution would be to modify the code so that it uses multiple GPUs with a single .yaml file.

Aug 20 '24 09:08 davidmartinrius

One way I see to train on multiple GPUs at once is to create several .yaml files, each with a different GPU and a different part of the dataset. This would require splitting the dataset into multiple parts and then, after training, combining the resulting .safetensors weights into a single file. I wouldn’t know how to do that merge.

However, the ideal solution would be to modify the code so that it uses multiple GPUs with a single .yaml file.

this has already been done with some different scripts, all in all the functionality is there and accelerate can be setup for multi-gpu from the start. It's just a matter of enabling more processes, equivalent to the number of gpus and loading each one with the dataset, spread the batch size across all the gpus (this would make a batch size per device and total batch), all this needs to be done on the same machine id aka rank0

Aug 20 '24 10:08 WarAnakin

Hello, is there a way to use multiple GPUs in the ai-toolkit config ? I'm trying to train with 2 x T4 GPU on Kaggle. thank you

Aug 20 '24 16:08 Teapack1

Hello, is there a way to use multiple GPUs in the ai-toolkit config ? I'm trying to train with 2 x T4 GPU on Kaggle. thank you

not yet

Aug 20 '24 17:08 WarAnakin

Also looking forward to the multi-gpu solutions!

Aug 21 '24 08:08 dydxdt

Also looking forward to the multi-gpu solutions!

Aug 26 '24 09:08 zhini-web

Yep pls implement multi gpu use

Sep 07 '24 13:09 sushmitxo

I confirm, two T4 x2 GPUs on Kaggle do not work. Editing the file config/examples/train_lora_flux_24gb.yaml does not help.

device: cuda:0 - Only one GPU # device: cuda:0 - Only CPU, then error device: cuda - Error device: cuda:0,1 - Error

device:
  - cuda:0
  - cuda:1

Error

Oct 02 '24 11:10 jwadow

jwadow

That's not how you run multi-gpu training. Simply editing the config file won't work.

Oct 04 '24 00:10 WarAnakin

+1

Nov 28 '24 01:11 vinch00

+1

Dec 04 '24 03:12 sherlhw

Isn't there any ways to use multiple GPU's yet?

Dec 10 '24 18:12 prp-e

not yet

Dec 10 '24 23:12 WarAnakin

@WarAnakin I hope it would be possible. It can reduce the train time for people like me who have access to multiple GPUs.

Dec 11 '24 11:12 prp-e

Error from kaggle

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB. GPU 0 has a total capacity of 14.74 GiB of which 2.12 MiB is free. Process 3302 has 14.74 GiB memory in use. Of the allocated memory 14.33 GiB is allocated by PyTorch, and 287.48 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Dec 26 '24 01:12 firofame

Error from kaggle

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB. GPU 0 has a total capacity of 14.74 GiB of which 2.12 MiB is free. Process 3302 has 14.74 GiB memory in use. Of the allocated memory 14.33 GiB is allocated by PyTorch, and 287.48 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Hello,

You might want to reduce the batch size, or resolution.

Dec 26 '24 17:12 WarAnakin

Sorry to ask again, is there any update on multi-gpu usage?

Dec 26 '24 17:12 prp-e

i don't mean to upset any1 but if you are really in a rush to train something and you need to use multi-gpus, kohya supports it

Jan 04 '25 07:01 WarAnakin

I have made a progress to set the ('split_model_over_gpus', True) in the ('model', OrderedDict([

('model', OrderedDict([ # huggingface model name or path ('name_or_path', 'black-forest-labs/FLUX.1-schnell'), ('assisstant_lora_path', 'ostris/FLUX.1-schnell-training-adapter'), ('is_flux', True), ('quantize', True), # run 8bit mixed precision ('split_model_over_gpus', True) # split the model over multiple gpus, Added by Alex so that the model can be run on multiple GPUs #('low_vram', True), # uncomment this if the GPU is connected to your monitors. It will use less vram to quantize, but is slower. ])), It can now properly split SD data on two GPU, but I quickly run into problems that: Error in LoRAModule lora_down Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument mat2 in method wrapper_CUDA_mm)

Have anyone found a solution to this?

Feb 12 '25 06:02 chenhh17

+1

Mar 28 '25 07:03 spuro

Still nothing? Why is MultiGPU ignored in so many of these projects?

Apr 25 '25 05:04 Oruli

Still nothing? Why is MultiGPU ignored in so many of these projects?

Most likely, the impossibility of simple writing and debugging of code due to the lack of multiple gpu on devs' pc, as well as changes in design features and hundreds of long tests. And also the mass lack of multi-GPU among users like you and me.

Apr 27 '25 23:04 jwadow

Still nothing? Why is MultiGPU ignored in so many of these projects?

Most likely, the impossibility of simple writing and debugging of code due to the lack of multiple gpu on devs' pc, as well as changes in design features and hundreds of long tests. And also the mass lack of multi-GPU among users like you and me.

Have you guys tried diffusion-pipe, which works on multiple GPUs by splitting the model, exactly as expected

May 07 '25 15:05 by1e11

Still nothing? Why is MultiGPU ignored in so many of these projects?

Most likely, the impossibility of simple writing and debugging of code due to the lack of multiple gpu on devs' pc, as well as changes in design features and hundreds of long tests. And also the mass lack of multi-GPU among users like you and me.

Have you guys tried diffusion-pipe, which works on multiple GPUs by splitting the model, exactly as expected

It doesn't work with 5090s. The dev told me it's related to a deepspeed issue, they are failing to fix it, waste of money buying them so far!

Jun 03 '25 18:06 Oruli

This has been open for a year, is there actually anyone on the project who can at least acknowledge it and add to a todo or something?

Jul 27 '25 17:07 Oruli

ai-toolkit ai-toolkit copied to clipboard

How to train Flux Lora on multiple GPUs?

ai-toolkit
ai-toolkit copied to clipboard