OneTrainer [Feat]: Add support of multi-GPU parallel training Lora?

Describe your use-case.

So far I haven't found a way to do this in parallel.

What would you like to see as a solution?

I don't know how to implement this feature.

Have you considered alternatives? List them here.

No response

Nov 24 '24 06:11 cciradih

+1

Jan 03 '25 15:01 Pevernow

Duplicate of #69. Pull requests adding multi-gpu support are welcome however.

Feb 16 '25 05:02 O-J1

I might be interested in playing with multi-GPU training, but the costs are prohibitive currently. I'd estimate many hours of 2x A5000 rental initially, and later a limited number of hours of 4-8x A100+ for performance tests.

Feel free to delete this comment, if this is considered solicitation. [I'm not interested in multi-GPU dataset preparation through MGDS, only training]

Feb 16 '25 06:02 dxqb

I've looked into this a bit, and torch.distributed seems much more suitable to be integrated into OT codebase than the usual accelerate or torch DDP

Apr 18 '25 19:04 dxqb

https://github.com/Nerogar/OneTrainer/pull/816

Apr 24 '25 19:04 dxqb

A draft implementation is now available. Testers are welcome.

Apr 24 '25 19:04 dxqb