OneTrainer Multi-GPU support

Multi-GPU support for OneTrainer

This is a draft, but it is intended to be feature-complete and work with all models, optimizations and other parameters OT has to offer.

Some basic tests have been done to ensure that multi-GPU training learns equally well as single GPU training, by comparing

validation loss of training on 4 GPUs
with training on 1 GPU, but with 4x the batch size
and with training on 1 GPU, but 4x the gradient accumulation

More testing is necessary though.

Performance impact for LoRA training is negligable, meaning that 4 GPUs train as fast as 1 GPU but with 4x the batch size: grafik

For full fine-tuning a model as large as Flux there is some performance impact. A minimum local batch size of 4 to 8 is required for it to make sense: grafik

Or better hardware: This test has been done on A5000s, which are older PCIe cards similar to 3090s, connected through PIX/PXB. If anyone has access to high-end hardware with NVLink connections it would be interesting to see the remaining performance impact on full finetuning.

Multi-GPU training should also work on Windows (even though less optimized, because NVIDIA nccl is not available for windows), but I have no way to test that. If anyone has a windows machine with 2 GPUs at home please confirm.

This PR includes https://github.com/Nerogar/OneTrainer/pull/803 and https://github.com/Nerogar/OneTrainer/pull/804

Apr 24 '25 19:04 dxqb

Limitations & bugs found by testers so far:

Limitation:

Caching is done on only 1 GPU. I am not really interested in implementing multi-GPU caching. This can be contributed separately if someone wants to do it.

Bugs:

[X] if caching takes more than 10 minutes, the other GPU process errors out because there is a timeout in NCCL how long GPUs wait for each other - fixed
[X] batch size is local batch size; make clear in the UI - done
[X] Latent caching must be enabled currently (https://github.com/Nerogar/mgds/pull/23) - both work now
[x] gradient reduction is done in train dtype precision - which is not ideal for low precision dtypes and differs from other implementations such as accelerate - implemented as an option, fp32 is default

Additional features that could be useful:

Local SGD - quite experimental, unknown if it's even useful for finetuning
Elastic averaging SGD - takes too much vram

Apr 27 '25 18:04 dxqb

Validation loss of single-GPU training (batch size 4) vs. 4 GPUs (local batch size 1, global batch size 4) with various settings:

May 18 '25 19:05 dxqb

all known issues above are complete now

Jul 10 '25 20:07 dxqb

awesome big work there

so it works out of the box? how to setup on interface?

Jul 10 '25 20:07 FurkanGozukara

all necessary settings have tooltip explanations: grafik

Jul 10 '25 23:07 dxqb

@dxqbYD thank you so much. also by any chance you are Turkish? :D

Jul 11 '25 00:07 FurkanGozukara

Ran on a local ubuntu (24.04) machine with 2x3090s. Got 4.89s/it with a rank 32 lora @ 1024px with SDXL. Worked fine and resulting lora had no issues during inference. Have not attempted finetuning

Jul 12 '25 07:07 O-J1

Does this method split a model (like HiDream)? For example, instead of 24Gb could it work on 12Gb+12Gb theoretically (with quantization and resolution reductuon)? Or it's just a sequential parallelism?

Aug 11 '25 19:08 gladichl

Does this method split a model (like HiDream)? For example, instead of 24Gb could it work on 12Gb+12Gb theoretically (with quantization and resolution reductuon)? Or it's just a sequential parallelism?

it does not shard the model, it's only data parallel.

while sharding is interesting, I do not see the use case for our models. sharding is basically an offloading technique: the model is split, but before each layer execution, all layer parameters must be re-gathered on all GPUs.

this requires high bandwidth between GPUs, otherwise @Nerogar 's very efficient async RAM offloading implementation is just better. if you have very high bandwidth between GPUs such as NVLINK, you are probably not limited on vram.

you can use RAM offloading to run a 24 gb model on 12 gb vram, on single GPU and on multi-GPU.

Aug 11 '25 20:08 dxqb

this requires high bandwidth between GPUs, otherwise @Nerogar 's very efficient async RAM offloading implementation is just better. if you have very high bandwidth between GPUs such as NVLINK, you are probably not limited on vram.

Does OneTrainer can offload some parts of a model into RAM for training or you meant that or it works for inference? (I don't know, if OneTrainer can work for inference) If yes (it can offload for training), I suppose that there is a limit? For example, for HiDream, is it possible to train the model by using 12 Gb VRAM and 64 Gb RAM (but by using nf8 or bf16) or not ?

Aug 11 '25 20:08 gladichl

this requires high bandwidth between GPUs, otherwise @Nerogar 's very efficient async RAM offloading implementation is just better. if you have very high bandwidth between GPUs such as NVLINK, you are probably not limited on vram.

Does OneTrainer can offload some parts of a model into RAM for training or you meant that or it works for inference? (I don't know, if OneTrainer can work for inference) If yes (it can offload for training), I suppose that there is a limit? For example, for HiDream, is it possible to train the model by using 12 Gb VRAM and 64 Gb RAM (but by using nf8 or bf16) or not ?

yes. please join our discord #help for these questions. there is no practical limit if you have enough RAM.

Aug 11 '25 21:08 dxqb

Thanks for the excellent multi-GPU training feature—it works perfectly!

I was wondering if you could also enable multi-GPU support for latent caching? The current single-GPU process is very slow and becomes a major bottleneck with large datasets.

Aug 14 '25 02:08 leonary

Thanks for testing!

I was wondering if you could also enable multi-GPU support for latent caching? The current single-GPU process is very slow and becomes a major bottleneck with large datasets.

Please see from above:

Caching is done on only 1 GPU. I am not really interested in implementing multi-GPU caching. This can be contributed separately if someone wants to do it.

Would you be able to implement this? If so, please join our Discord server and I can point you in some (code) directions

Aug 14 '25 08:08 dxqb

tests on windows:

[x] backend is not autodetected - hardcode to 'gloo' on windows
[x] disable call to 'nvidia-smi topo' - not available on Windows
[x] device mismatch if master rank is not GPU #0 during caching

Aug 18 '25 08:08 dxqb

[x] https://github.com/Nerogar/mgds/pull/23#discussion_r2219310517
[ ] Chroma

Aug 24 '25 21:08 dxqb

Seems to be broken when Local Batch Size > 1. It starts the setup and then when it's supposed to start training, it shows epoch: 0it [00:00, ?it/s] and OneTrainer GUI shows stopped.

This is in Windows 10 with 2x RTX3090, Torch 2.7.1+cu128, Python3.12.

Sep 02 '25 19:09 3dluvr

Seems to be broken when Local Batch Size > 1. It starts the setup and then when it's supposed to start training, it shows epoch: 0it [00:00, ?it/s] and OneTrainer GUI shows stopped.

This is in Windows 10 with 2x RTX3090, Torch 2.7.1+cu128, Python3.12.

Wasn’t broken with BS of 8 at the time I made the prior comment. Please also pull that commit: https://github.com/Nerogar/OneTrainer/pull/816#issuecomment-3064913121

Additionally you need to actually copy paste the console error and upload your config.JSON for us to actually do anything about this. Make sure to ctrl f replace your username first before uploading it.

Sep 02 '25 19:09 O-J1

Wasn’t broken with BS of 8 at the time I made the prior comment. Please also pull that commit: #816 (comment)

Additionally you need to actually copy paste the console error and upload your config.JSON for us to actually do anything about this. Make sure to ctrl f replace your username first before uploading it.

Thanks for the speedy reply.

I don't see which additional commit you are referring to in the linked comment?

I merged this PR into the OneTrainer master, then used the Chroma1 preset to try training a LoRA. All default settings except for enabling Multi-GPU and Local Batch Size of 2. Used Chroma1-Base model.

There are no errors in the console though, that's the thing. All I see is epoch: 0it [00:00, ?it/s] after which the GUI shows that training has stopped. I can click Start Training again, but it will lead to the same outcome.

Sep 02 '25 20:09 3dluvr

Thanks for the speedy reply.

I don't see which additional commit you are referring to in the linked comment?

There are no errors in the console though, that's the thing. All I see is epoch: 0it [00:00, ?it/s] after which the GUI shows that training has stopped. I can click Start Training again, but it will lead to the same outcome.

Sorry I misspoke, I meant to say the commit from that time I commented, which is: 8208f23

Also I can’t hang around but this sounds like user error, OT user aspect ratio bucketing. If you don’t have enough images per aspect ratio to fill a bucket (which is equal to batch size) then the images will be dropped. Leading to no training or instant training

https://github.com/Nerogar/OneTrainer/wiki/Aspect-Ratio-Bucketing

Please give the entire tab explanation section a full read

Sep 02 '25 20:09 O-J1

Also I can’t hang around but this sounds like user error, OT user aspect ratio bucketing. If you don’t have enough images per aspect ratio to fill a bucket (which is equal to batch size) then the images will be dropped. Leading to no training or instant training

You were right, it was an user error - I starved the trainer !

I used my old 3 image set (1024x1024) from when I was training a Flux LoRA with fluxgym and sd-scripts. It worked there with BS of 2, but it also had 10 repeats configured on the dataset making it 30 images. Adding another 4 new images to the dataset now works for Batch Size of 2.

Sadly, still not getting even a close likeness of the dataset images in the Chroma LoRA though. In Flux I'd get a decent result with just 3 images in the dataset and handful of epochs, but Chroma is a different kind of beast I guess.

Thanks!

Sep 02 '25 22:09 3dluvr

[x] batches are built first, and then distributed - means that same aspect ratio on all GPUs. Is that good/necessary?

Sep 05 '25 05:09 dxqb

Hi,

I'm probably one of the rare Windows user having 2 GPU and can run some tests. first tests managed to solve https://github.com/Nerogar/OneTrainer/pull/816#issuecomment-3195656013 . that was due because some nvidia-smi commands are not available on Windows.

To all who tested this PR, what do you think worth to be tested in Windows context ? My current tests are very simple with Flux Lora training, activating the multi GPU and define the GPU to be used as 0,1 or 1,0. I didn't tested the other multi GPU related settings in the general tab.

I don't think other training settings affect much this PR but there are always exception. First known is with samples, if exposed to the tensorboard, only one GPU is used for sampling, if not exposed they can use the 2 GPU (not really tested as I set only one sample).

I don't use validation nor PP datasets but I can if needed.

So my request is just to bring me a test plan so that PR that sounds like working perfectly could be released. So far main difference of multi GPU between Linux and Windows is on nvidia-smi commands as far I know but they could be others. The request is for all testers and contributors, we all want that PR to be merged into the master branch.

Reminder: I like to define myself as a simple user, so not a dev nor much technical but I can help with testing with a test plan.

Thanks !

Sep 14 '25 11:09 hyppyhyppo

... what to test ...

Just a finetune test if possible, everything else should be ok

Sep 14 '25 11:09 O-J1

Sharing my tests on Windows, SDXL fine tune, default settings, 163 images all cropped to the training res (1024x1024 and 832x1216), local batch 4, accumulation step 1 then 2, 2 x NVIDIA RTX 5090. training devices: 1,0.

FLOAT_32_STOCHASTIC, AS 1 Training step 36 s/it 22-24 VRAM used on devices (1,0). FLOAT_32_STOCHASTIC, AS 2 Training step 15 s/it 25-27 VRAM used on devices (1,0).

FLOAT_32, AS 1 Training step 30 s/it 22-24 VRAM used on devices (1,0). FLOAT_32, AS 2 Training step 15 s/it 25-27 VRAM used on devices (1,0).

WEIGHT_DTYPE_STOCHASTIC, AS 1 Training step 15 s/it 22-24 VRAM used on devices (1,0). WEIGHT_DTYPE_STOCHASTIC, AS 2 Training step 7 s/it 25-27 VRAM used on devices (1,0).

WEIGHT_DTYPE, AS 1 Training step 15 s/it 22-24 VRAM used on devices (1,0). WEIGHT_DTYPE, AS 2 Training step 7 s/it 25-27 VRAM used on devices (1,0).

Note that:

for SDXL Lora, training steps are around 1.2 s/it with FLOAT_32_STOCHASTIC, AS 1 and 10-12 VRAM used on devices 1,0.
for SDXL embedding, training steps are around 1.5 s/it with FLOAT_32_STOCHASTIC, AS 1 and 10-12 VRAM used on devices 1,0. Seems to be a Windows issue according to dxqb, not a bug in the PR and only with finetuning. Lora training works perfectly (tests made on SDXL and Flux).

That's just to note down these results for the Wiki. If someone on Linux could run the same tests with similar cards, Wiki would be grateful, I won't switch to Linux just to get a benchmark ;)

Just one remark, after testing SDXL fine tune with all settings, then SDXL Lora and finally SDXL embedding in the same instance, OT get stuck when saving the model, the embedding get saved but neither the UI or the console said "training stopped". I closed the instance, started a new but could not reproduce the issue, single and multi GPU SDXL embedding all ok. Just that 1.2 s/it for single GPU but 15 GB VRAM, 1.5 with 2 GPU but 10-12 GB VRAM. Same dataset and BS, default settings always.

Sep 16 '25 11:09 hyppyhyppo

Sharing my tests on Windows, SDXL fine tune, default settings, 163 images all cropped to the training res (1024x1024 and 832x1216), local batch 4, accumulation step 1 then 2, 2 x NVIDIA RTX 5090. training devices: 1,0.

FLOAT_32_STOCHASTIC, AS 1 Training step 36 s/it 22-24 VRAM used on devices (1,0). FLOAT_32_STOCHASTIC, AS 2 Training step 15 s/it 25-27 VRAM used on devices (1,0).

FLOAT_32, AS 1 Training step 30 s/it 22-24 VRAM used on devices (1,0). FLOAT_32, AS 2 Training step 15 s/it 25-27 VRAM used on devices (1,0).

WEIGHT_DTYPE_STOCHASTIC, AS 1 Training step 15 s/it 22-24 VRAM used on devices (1,0). WEIGHT_DTYPE_STOCHASTIC, AS 2 Training step 7 s/it 25-27 VRAM used on devices (1,0).

WEIGHT_DTYPE, AS 1 Training step 15 s/it 22-24 VRAM used on devices (1,0). WEIGHT_DTYPE, AS 2 Training step 7 s/it 25-27 VRAM used on devices (1,0).

Your performance seems to scale with the amount of data that has to be transferred between GPUs. Doubling AS halfs the amount of data, switching to WEIGHT_DTYPE halfs it another time.

Gloo (the backend on windows) is known to be much less efficient then NCCL (only available on Linux). However, I just read on a forum that Gloo is blocking and doesn't work asynchronously. If that's the case, it could help to disabled "Fused Gradient Reduce":

NCCL is more efficient with it (the data is split up into small parts and transferred asynchronously during the backward pass). Gloo might have an issue with many small transfers.

Sep 20 '25 13:09 dxqb

[x] batches are built first, and then distributed - means that same aspect ratio on all GPUs. Is that good/necessary?

changing this would improve samples dropping, but isn't good for multi-resolution training. all GPUs would have to wait for the slowest GPUs. For example, if you have some 1024px mixed into mainly 512px training, training would be much slower.

aspect batch sorting is therefore done on the global batch size

Sep 21 '25 13:09 dxqb

Merge with:

[ ] Chroma
[ ] Qwen

Sep 21 '25 13:09 dxqb