javacpp-presets icon indicating copy to clipboard operation
javacpp-presets copied to clipboard

[pytorch] how to Distributed train the model use javacpp pytorch

Open mullerhai opened this issue 2 years ago • 6 comments

Hi, now for the big model ,we need train model use many dirstribute machine, so in python version we could use distribute assert to declear train model in many machine ,but now in javacpp pytorch,we can not find distribute method,how to do this in javacpp

mullerhai avatar Jul 29 '23 05:07 mullerhai

now also has good distribute train model tools ,like https://github.com/OpenBMB/BMTrain https://github.com/OpenBMB/BMTrain.git, if we could compile it to java maybe could try it

mullerhai avatar Jul 29 '23 05:07 mullerhai

and we also want to add spark-gpu train with javacpp torch

mullerhai avatar Jul 29 '23 05:07 mullerhai

Any progress on this? Any plan to support torch.distributed? Thanks!

haifengl avatar May 21 '24 20:05 haifengl

That's ongoing, but I know little about this API. I will post a PR when I get a first version compiling but I'll need you to test and see if all what is needed has been mapped.

HGuillemet avatar May 23 '24 14:05 HGuillemet

Sounds good. Will do.

haifengl avatar May 23 '24 15:05 haifengl

Could we limit to Gloo backend for pytorch and to NCCL for pytorch-gpu ? That is, no support for MPI and UCC ?

HGuillemet avatar May 25 '24 07:05 HGuillemet