Multi-GPU support
Multi-GPU support for OneTrainer
This is a draft, but it is intended to be feature-complete and work with all models, optimizations and other parameters OT has to offer.
Some basic tests have been done to ensure that multi-GPU training learns equally well as single GPU training, by comparing
- validation loss of training on 4 GPUs
- with training on 1 GPU, but with 4x the batch size
- and with training on 1 GPU, but 4x the gradient accumulation
More testing is necessary though.
Performance impact for LoRA training is negligable, meaning that 4 GPUs train as fast as 1 GPU but with 4x the batch size:
For full fine-tuning a model as large as Flux there is some performance impact. A minimum local batch size of 4 to 8 is required for it to make sense:
Or better hardware: This test has been done on A5000s, which are older PCIe cards similar to 3090s, connected through PIX/PXB. If anyone has access to high-end hardware with NVLink connections it would be interesting to see the remaining performance impact on full finetuning.
Multi-GPU training should also work on Windows (even though less optimized, because NVIDIA nccl is not available for windows), but I have no way to test that. If anyone has a windows machine with 2 GPUs at home please confirm.
This PR includes https://github.com/Nerogar/OneTrainer/pull/803 and https://github.com/Nerogar/OneTrainer/pull/804
Limitations & bugs found by testers so far:
Limitation:
- Caching is done on only 1 GPU. I am not really interested in implementing multi-GPU caching. This can be contributed separately if someone wants to do it.
Bugs:
- [X] if caching takes more than 10 minutes, the other GPU process errors out because there is a timeout in NCCL how long GPUs wait for each other - fixed
- [X] batch size is local batch size; make clear in the UI - done
- [X] Latent caching must be enabled currently (https://github.com/Nerogar/mgds/pull/23) - both work now
- [x] gradient reduction is done in train dtype precision - which is not ideal for low precision dtypes and differs from other implementations such as
accelerate- implemented as an option, fp32 is default
Additional features that could be useful:
- Local SGD - quite experimental, unknown if it's even useful for finetuning
- Elastic averaging SGD - takes too much vram
Validation loss of single-GPU training (batch size 4) vs. 4 GPUs (local batch size 1, global batch size 4) with various settings:
all known issues above are complete now
awesome big work there
so it works out of the box? how to setup on interface?
all necessary settings have tooltip explanations:
@dxqbYD thank you so much. also by any chance you are Turkish? :D
Ran on a local ubuntu (24.04) machine with 2x3090s. Got 4.89s/it with a rank 32 lora @ 1024px with SDXL. Worked fine and resulting lora had no issues during inference. Have not attempted finetuning
Does this method split a model (like HiDream)? For example, instead of 24Gb could it work on 12Gb+12Gb theoretically (with quantization and resolution reductuon)? Or it's just a sequential parallelism?
Does this method split a model (like HiDream)? For example, instead of 24Gb could it work on 12Gb+12Gb theoretically (with quantization and resolution reductuon)? Or it's just a sequential parallelism?
it does not shard the model, it's only data parallel.
while sharding is interesting, I do not see the use case for our models. sharding is basically an offloading technique: the model is split, but before each layer execution, all layer parameters must be re-gathered on all GPUs.
this requires high bandwidth between GPUs, otherwise @Nerogar 's very efficient async RAM offloading implementation is just better. if you have very high bandwidth between GPUs such as NVLINK, you are probably not limited on vram.
you can use RAM offloading to run a 24 gb model on 12 gb vram, on single GPU and on multi-GPU.
this requires high bandwidth between GPUs, otherwise @Nerogar 's very efficient async RAM offloading implementation is just better. if you have very high bandwidth between GPUs such as NVLINK, you are probably not limited on vram.
Does OneTrainer can offload some parts of a model into RAM for training or you meant that or it works for inference? (I don't know, if OneTrainer can work for inference) If yes (it can offload for training), I suppose that there is a limit? For example, for HiDream, is it possible to train the model by using 12 Gb VRAM and 64 Gb RAM (but by using nf8 or bf16) or not ?
this requires high bandwidth between GPUs, otherwise @Nerogar 's very efficient async RAM offloading implementation is just better. if you have very high bandwidth between GPUs such as NVLINK, you are probably not limited on vram.
Does OneTrainer can offload some parts of a model into RAM for training or you meant that or it works for inference? (I don't know, if OneTrainer can work for inference) If yes (it can offload for training), I suppose that there is a limit? For example, for HiDream, is it possible to train the model by using 12 Gb VRAM and 64 Gb RAM (but by using nf8 or bf16) or not ?
yes. please join our discord #help for these questions. there is no practical limit if you have enough RAM.
Thanks for the excellent multi-GPU training feature—it works perfectly!
I was wondering if you could also enable multi-GPU support for latent caching? The current single-GPU process is very slow and becomes a major bottleneck with large datasets.
Thanks for testing!
I was wondering if you could also enable multi-GPU support for latent caching? The current single-GPU process is very slow and becomes a major bottleneck with large datasets.
Please see from above:
Caching is done on only 1 GPU. I am not really interested in implementing multi-GPU caching. This can be contributed separately if someone wants to do it.
Would you be able to implement this? If so, please join our Discord server and I can point you in some (code) directions
tests on windows:
- [x] backend is not autodetected - hardcode to 'gloo' on windows
- [x] disable call to 'nvidia-smi topo' - not available on Windows
- [x] device mismatch if master rank is not GPU #0 during caching
- [x] https://github.com/Nerogar/mgds/pull/23#discussion_r2219310517
- [ ] Chroma
Seems to be broken when Local Batch Size > 1. It starts the setup and then when it's supposed to start training, it shows epoch: 0it [00:00, ?it/s] and OneTrainer GUI shows stopped.
This is in Windows 10 with 2x RTX3090, Torch 2.7.1+cu128, Python3.12.
Seems to be broken when Local Batch Size > 1. It starts the setup and then when it's supposed to start training, it shows
epoch: 0it [00:00, ?it/s]and OneTrainer GUI showsstopped.This is in Windows 10 with 2x RTX3090, Torch 2.7.1+cu128, Python3.12.
Wasn’t broken with BS of 8 at the time I made the prior comment. Please also pull that commit: https://github.com/Nerogar/OneTrainer/pull/816#issuecomment-3064913121
Additionally you need to actually copy paste the console error and upload your config.JSON for us to actually do anything about this. Make sure to ctrl f replace your username first before uploading it.
Wasn’t broken with BS of 8 at the time I made the prior comment. Please also pull that commit: #816 (comment)
Additionally you need to actually copy paste the console error and upload your config.JSON for us to actually do anything about this. Make sure to ctrl f replace your username first before uploading it.
Thanks for the speedy reply.
I don't see which additional commit you are referring to in the linked comment?
I merged this PR into the OneTrainer master, then used the Chroma1 preset to try training a LoRA. All default settings except for enabling Multi-GPU and Local Batch Size of 2. Used Chroma1-Base model.
There are no errors in the console though, that's the thing. All I see is epoch: 0it [00:00, ?it/s] after which the GUI shows that training has stopped. I can click Start Training again, but it will lead to the same outcome.
Thanks for the speedy reply.
I don't see which additional commit you are referring to in the linked comment?
There are no errors in the console though, that's the thing. All I see is
epoch: 0it [00:00, ?it/s]after which the GUI shows that training has stopped. I can clickStart Trainingagain, but it will lead to the same outcome.
Sorry I misspoke, I meant to say the commit from that time I commented, which is: 8208f23
Also I can’t hang around but this sounds like user error, OT user aspect ratio bucketing. If you don’t have enough images per aspect ratio to fill a bucket (which is equal to batch size) then the images will be dropped. Leading to no training or instant training
https://github.com/Nerogar/OneTrainer/wiki/Aspect-Ratio-Bucketing
Please give the entire tab explanation section a full read
Also I can’t hang around but this sounds like user error, OT user aspect ratio bucketing. If you don’t have enough images per aspect ratio to fill a bucket (which is equal to batch size) then the images will be dropped. Leading to no training or instant training
You were right, it was an user error - I starved the trainer !
I used my old 3 image set (1024x1024) from when I was training a Flux LoRA with fluxgym and sd-scripts. It worked there with BS of 2, but it also had 10 repeats configured on the dataset making it 30 images. Adding another 4 new images to the dataset now works for Batch Size of 2.
Sadly, still not getting even a close likeness of the dataset images in the Chroma LoRA though. In Flux I'd get a decent result with just 3 images in the dataset and handful of epochs, but Chroma is a different kind of beast I guess.
Thanks!
- [x] batches are built first, and then distributed - means that same aspect ratio on all GPUs. Is that good/necessary?
Hi,
I'm probably one of the rare Windows user having 2 GPU and can run some tests. first tests managed to solve https://github.com/Nerogar/OneTrainer/pull/816#issuecomment-3195656013 . that was due because some nvidia-smi commands are not available on Windows.
To all who tested this PR, what do you think worth to be tested in Windows context ? My current tests are very simple with Flux Lora training, activating the multi GPU and define the GPU to be used as 0,1 or 1,0. I didn't tested the other multi GPU related settings in the general tab.
I don't think other training settings affect much this PR but there are always exception. First known is with samples, if exposed to the tensorboard, only one GPU is used for sampling, if not exposed they can use the 2 GPU (not really tested as I set only one sample).
I don't use validation nor PP datasets but I can if needed.
So my request is just to bring me a test plan so that PR that sounds like working perfectly could be released. So far main difference of multi GPU between Linux and Windows is on nvidia-smi commands as far I know but they could be others. The request is for all testers and contributors, we all want that PR to be merged into the master branch.
Reminder: I like to define myself as a simple user, so not a dev nor much technical but I can help with testing with a test plan.
Thanks !
... what to test ...
Just a finetune test if possible, everything else should be ok
Sharing my tests on Windows, SDXL fine tune, default settings, 163 images all cropped to the training res (1024x1024 and 832x1216), local batch 4, accumulation step 1 then 2, 2 x NVIDIA RTX 5090. training devices: 1,0.
FLOAT_32_STOCHASTIC, AS 1 Training step 36 s/it 22-24 VRAM used on devices (1,0). FLOAT_32_STOCHASTIC, AS 2 Training step 15 s/it 25-27 VRAM used on devices (1,0).
FLOAT_32, AS 1 Training step 30 s/it 22-24 VRAM used on devices (1,0). FLOAT_32, AS 2 Training step 15 s/it 25-27 VRAM used on devices (1,0).
WEIGHT_DTYPE_STOCHASTIC, AS 1 Training step 15 s/it 22-24 VRAM used on devices (1,0). WEIGHT_DTYPE_STOCHASTIC, AS 2 Training step 7 s/it 25-27 VRAM used on devices (1,0).
WEIGHT_DTYPE, AS 1 Training step 15 s/it 22-24 VRAM used on devices (1,0). WEIGHT_DTYPE, AS 2 Training step 7 s/it 25-27 VRAM used on devices (1,0).
Note that:
- for SDXL Lora, training steps are around 1.2 s/it with FLOAT_32_STOCHASTIC, AS 1 and 10-12 VRAM used on devices 1,0.
- for SDXL embedding, training steps are around 1.5 s/it with FLOAT_32_STOCHASTIC, AS 1 and 10-12 VRAM used on devices 1,0. Seems to be a Windows issue according to dxqb, not a bug in the PR and only with finetuning. Lora training works perfectly (tests made on SDXL and Flux).
That's just to note down these results for the Wiki. If someone on Linux could run the same tests with similar cards, Wiki would be grateful, I won't switch to Linux just to get a benchmark ;)
Just one remark, after testing SDXL fine tune with all settings, then SDXL Lora and finally SDXL embedding in the same instance, OT get stuck when saving the model, the embedding get saved but neither the UI or the console said "training stopped". I closed the instance, started a new but could not reproduce the issue, single and multi GPU SDXL embedding all ok. Just that 1.2 s/it for single GPU but 15 GB VRAM, 1.5 with 2 GPU but 10-12 GB VRAM. Same dataset and BS, default settings always.
Sharing my tests on Windows, SDXL fine tune, default settings, 163 images all cropped to the training res (1024x1024 and 832x1216), local batch 4, accumulation step 1 then 2, 2 x NVIDIA RTX 5090. training devices: 1,0.
FLOAT_32_STOCHASTIC, AS 1 Training step 36 s/it 22-24 VRAM used on devices (1,0). FLOAT_32_STOCHASTIC, AS 2 Training step 15 s/it 25-27 VRAM used on devices (1,0).
FLOAT_32, AS 1 Training step 30 s/it 22-24 VRAM used on devices (1,0). FLOAT_32, AS 2 Training step 15 s/it 25-27 VRAM used on devices (1,0).
WEIGHT_DTYPE_STOCHASTIC, AS 1 Training step 15 s/it 22-24 VRAM used on devices (1,0). WEIGHT_DTYPE_STOCHASTIC, AS 2 Training step 7 s/it 25-27 VRAM used on devices (1,0).
WEIGHT_DTYPE, AS 1 Training step 15 s/it 22-24 VRAM used on devices (1,0). WEIGHT_DTYPE, AS 2 Training step 7 s/it 25-27 VRAM used on devices (1,0).
Your performance seems to scale with the amount of data that has to be transferred between GPUs. Doubling AS halfs the amount of data, switching to WEIGHT_DTYPE halfs it another time.
Gloo (the backend on windows) is known to be much less efficient then NCCL (only available on Linux). However, I just read on a forum that Gloo is blocking and doesn't work asynchronously. If that's the case, it could help to disabled "Fused Gradient Reduce":
NCCL is more efficient with it (the data is split up into small parts and transferred asynchronously during the backward pass). Gloo might have an issue with many small transfers.
- [x] batches are built first, and then distributed - means that same aspect ratio on all GPUs. Is that good/necessary?
changing this would improve samples dropping, but isn't good for multi-resolution training. all GPUs would have to wait for the slowest GPUs. For example, if you have some 1024px mixed into mainly 512px training, training would be much slower.
aspect batch sorting is therefore done on the global batch size
Merge with:
- [ ] Chroma
- [ ] Qwen