tch-rs icon indicating copy to clipboard operation
tch-rs copied to clipboard

Proposal: support distributed training in Rust

Open NOBLES5E opened this issue 2 years ago • 2 comments

As datasets and models grow larger, single GPU training can become a limiting factor in many moderate sized tasks. I am thinking of adding a distributed training example for tch. To achieve this, there are two things to be done

  1. Distributed communication engine supporting Rust: I can do it with our recently open sourced bagua, which has a Rust backend bagua-core.
  2. Tensor hooks so that we can schedule communication when for example a gradient is ready: we need to wrap the VariableHooksInterface.h in torch-sys, as mentioned in https://github.com/LaurentMazare/tch-rs/issues/218. This seems to be not difficult.

@LaurentMazare I appreciate if you have time commenting this and see if the direction is right. Thanks!

NOBLES5E avatar Jul 24 '21 02:07 NOBLES5E

Sounds like a great idea. I would suggest implementing this in a separate repo/crate to start with as it hopefully will be independent from the main tch implementation, and we can have some link from the readme once ready so that it's easier to discover. Re (2) I'm not sure it's actually that easy, the thing I'm mostly worried about is deallocating the hook functions once the variables are not used any more. It's not very clear to me how that would work. This would only be an issue for closures and not for static functions but I doubt that hooks would be very useful for static functions only.

LaurentMazare avatar Jul 25 '21 23:07 LaurentMazare

Great, I can start doing it (the distributed training part) in a separate repo to see if it works.

For (2) it seems that if we want to add hook support, it is better to be in this repo?

NOBLES5E avatar Jul 29 '21 09:07 NOBLES5E

I guess this is dead?

Are there any other attempts at supporting distributed/parallel training for Rust ml?

John0x avatar Dec 01 '22 11:12 John0x

@John0x Yes, this is dead since I left my previous company where I did distributed training. I would say this is a great topic to work on. Would love to see someone else gets interested in this.

NOBLES5E avatar Apr 22 '23 04:04 NOBLES5E

Closing this for now as it indeed has been a while.

LaurentMazare avatar May 14 '23 11:05 LaurentMazare

Does it make sense to leave this open to track the feature even if it isn't currently planned? It would be nice to have a place to subscribe for updates.

kevincox avatar Nov 08 '23 17:11 kevincox