ai-toolkit
ai-toolkit copied to clipboard
How to train Flux Lora on multiple GPUs?
I need help training Flux Lora on multiple GPUs. The memory on a single GPU is not sufficient, so I want to train on multiple GPUs. However, configuring device: cuda:0,1 in the config file doesn't seem to work.
Could you please provide guidance on how to properly set up and run Flux Lora training across multiple GPUs? The current single-GPU memory limitation is preventing me from training effectively.
Any assistance or examples of multi-GPU configurations for Flux Lora would be greatly appreciated. Thank you!
I have the same problem. Have you solved it?
i'm currently making changes to the scripts on my end to run mult-gpus. I have quite a bit of requests and 1 gpu doesn't cut it. I know that the kohya version of flux can run multi-gpus
Also looking into this!
Same problem here.
yeah same issue, testing on one GPU is working great but can't see myself using this in the future without multi GPU
One way I see to train on multiple GPUs at once is to create several .yaml files, each with a different GPU and a different part of the dataset. This would require splitting the dataset into multiple parts and then, after training, combining the resulting .safetensors weights into a single file. I wouldn’t know how to do that merge.
However, the ideal solution would be to modify the code so that it uses multiple GPUs with a single .yaml file.
One way I see to train on multiple GPUs at once is to create several .yaml files, each with a different GPU and a different part of the dataset. This would require splitting the dataset into multiple parts and then, after training, combining the resulting .safetensors weights into a single file. I wouldn’t know how to do that merge.
However, the ideal solution would be to modify the code so that it uses multiple GPUs with a single .yaml file.
this has already been done with some different scripts, all in all the functionality is there and accelerate can be setup for multi-gpu from the start. It's just a matter of enabling more processes, equivalent to the number of gpus and loading each one with the dataset, spread the batch size across all the gpus (this would make a batch size per device and total batch), all this needs to be done on the same machine id aka rank0
Hello, is there a way to use multiple GPUs in the ai-toolkit config ? I'm trying to train with 2 x T4 GPU on Kaggle. thank you
Hello, is there a way to use multiple GPUs in the ai-toolkit config ? I'm trying to train with 2 x T4 GPU on Kaggle. thank you
not yet
Also looking forward to the multi-gpu solutions!
Also looking forward to the multi-gpu solutions!
Yep pls implement multi gpu use
I confirm, two T4 x2 GPUs on Kaggle do not work. Editing the file config/examples/train_lora_flux_24gb.yaml does not help.
device: cuda:0 - Only one GPU
# device: cuda:0 - Only CPU, then error
device: cuda - Error
device: cuda:0,1 - Error
device:
- cuda:0
- cuda:1
Error
jwadow
That's not how you run multi-gpu training. Simply editing the config file won't work.
+1
+1
Isn't there any ways to use multiple GPU's yet?
not yet
@WarAnakin I hope it would be possible. It can reduce the train time for people like me who have access to multiple GPUs.
Error from kaggle
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB. GPU 0 has a total capacity of 14.74 GiB of which 2.12 MiB is free. Process 3302 has 14.74 GiB memory in use. Of the allocated memory 14.33 GiB is allocated by PyTorch, and 287.48 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Error from kaggle
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB. GPU 0 has a total capacity of 14.74 GiB of which 2.12 MiB is free. Process 3302 has 14.74 GiB memory in use. Of the allocated memory 14.33 GiB is allocated by PyTorch, and 287.48 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Hello,
You might want to reduce the batch size, or resolution.
Sorry to ask again, is there any update on multi-gpu usage?
i don't mean to upset any1 but if you are really in a rush to train something and you need to use multi-gpus, kohya supports it
I have made a progress to set the ('split_model_over_gpus', True) in the ('model', OrderedDict([
('model', OrderedDict([ # huggingface model name or path ('name_or_path', 'black-forest-labs/FLUX.1-schnell'), ('assisstant_lora_path', 'ostris/FLUX.1-schnell-training-adapter'), ('is_flux', True), ('quantize', True), # run 8bit mixed precision ('split_model_over_gpus', True) # split the model over multiple gpus, Added by Alex so that the model can be run on multiple GPUs #('low_vram', True), # uncomment this if the GPU is connected to your monitors. It will use less vram to quantize, but is slower. ])),
It can now properly split SD data on two GPU, but I quickly run into problems that:
Error in LoRAModule lora_down Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument mat2 in method wrapper_CUDA_mm)
Have anyone found a solution to this?
+1
Still nothing? Why is MultiGPU ignored in so many of these projects?
Still nothing? Why is MultiGPU ignored in so many of these projects?
Most likely, the impossibility of simple writing and debugging of code due to the lack of multiple gpu on devs' pc, as well as changes in design features and hundreds of long tests. And also the mass lack of multi-GPU among users like you and me.
Still nothing? Why is MultiGPU ignored in so many of these projects?
Most likely, the impossibility of simple writing and debugging of code due to the lack of multiple gpu on devs' pc, as well as changes in design features and hundreds of long tests. And also the mass lack of multi-GPU among users like you and me.
Have you guys tried diffusion-pipe, which works on multiple GPUs by splitting the model, exactly as expected
Still nothing? Why is MultiGPU ignored in so many of these projects?
Most likely, the impossibility of simple writing and debugging of code due to the lack of multiple gpu on devs' pc, as well as changes in design features and hundreds of long tests. And also the mass lack of multi-GPU among users like you and me.
Have you guys tried diffusion-pipe, which works on multiple GPUs by splitting the model, exactly as expected
It doesn't work with 5090s. The dev told me it's related to a deepspeed issue, they are failing to fix it, waste of money buying them so far!
This has been open for a year, is there actually anyone on the project who can at least acknowledge it and add to a todo or something?