OneTrainer icon indicating copy to clipboard operation
OneTrainer copied to clipboard

[Feat]: Add support of multi-GPU parallel training Lora?

Open cciradih opened this issue 1 year ago • 6 comments

Describe your use-case.

So far I haven't found a way to do this in parallel.

What would you like to see as a solution?

I don't know how to implement this feature.

Have you considered alternatives? List them here.

No response

cciradih avatar Nov 24 '24 06:11 cciradih

+1

Pevernow avatar Jan 03 '25 15:01 Pevernow

Duplicate of #69. Pull requests adding multi-gpu support are welcome however.

O-J1 avatar Feb 16 '25 05:02 O-J1

I might be interested in playing with multi-GPU training, but the costs are prohibitive currently. I'd estimate many hours of 2x A5000 rental initially, and later a limited number of hours of 4-8x A100+ for performance tests.

Feel free to delete this comment, if this is considered solicitation. [I'm not interested in multi-GPU dataset preparation through MGDS, only training]

dxqb avatar Feb 16 '25 06:02 dxqb

I've looked into this a bit, and torch.distributed seems much more suitable to be integrated into OT codebase than the usual accelerate or torch DDP

dxqb avatar Apr 18 '25 19:04 dxqb

https://github.com/Nerogar/OneTrainer/pull/816

dxqb avatar Apr 24 '25 19:04 dxqb

A draft implementation is now available. Testers are welcome.

dxqb avatar Apr 24 '25 19:04 dxqb