javacpp-presets
javacpp-presets copied to clipboard
[pytorch] how to Distributed train the model use javacpp pytorch
Hi, now for the big model ,we need train model use many dirstribute machine, so in python version we could use distribute assert to declear train model in many machine ,but now in javacpp pytorch,we can not find distribute method,how to do this in javacpp
now also has good distribute train model tools ,like https://github.com/OpenBMB/BMTrain https://github.com/OpenBMB/BMTrain.git, if we could compile it to java maybe could try it
and we also want to add spark-gpu train with javacpp torch
Any progress on this? Any plan to support torch.distributed? Thanks!
That's ongoing, but I know little about this API. I will post a PR when I get a first version compiling but I'll need you to test and see if all what is needed has been mapped.
Sounds good. Will do.
Could we limit to Gloo backend for pytorch and to NCCL for pytorch-gpu ? That is, no support for MPI and UCC ?