javacpp-presets [pytorch] how to Distributed train the model use javacpp pytorch

[pytorch] how to Distributed train the model use javacpp pytorch

Open mullerhai opened this issue 2 years ago • 6 comments

Hi, now for the big model ,we need train model use many dirstribute machine, so in python version we could use distribute assert to declear train model in many machine ,but now in javacpp pytorch,we can not find distribute method,how to do this in javacpp

Jul 29 '23 05:07 mullerhai

now also has good distribute train model tools ,like https://github.com/OpenBMB/BMTrain https://github.com/OpenBMB/BMTrain.git, if we could compile it to java maybe could try it

Jul 29 '23 05:07 mullerhai

and we also want to add spark-gpu train with javacpp torch

Jul 29 '23 05:07 mullerhai

Any progress on this? Any plan to support torch.distributed? Thanks!

May 21 '24 20:05 haifengl

That's ongoing, but I know little about this API. I will post a PR when I get a first version compiling but I'll need you to test and see if all what is needed has been mapped.

May 23 '24 14:05 HGuillemet

Sounds good. Will do.

May 23 '24 15:05 haifengl

Could we limit to Gloo backend for pytorch and to NCCL for pytorch-gpu ? That is, no support for MPI and UCC ?

May 25 '24 07:05 HGuillemet

javacpp-presets javacpp-presets copied to clipboard

[pytorch] how to Distributed train the model use javacpp pytorch

javacpp-presets
javacpp-presets copied to clipboard