pytorch icon indicating copy to clipboard operation
pytorch copied to clipboard

Add support for distributed optimizer

Open athitten opened this issue 4 years ago • 5 comments

Enable no_copy argument in c10d files for distributed optimizers

athitten avatar Nov 01 '21 15:11 athitten

Please address CI failures. Source code formatting changes were flagged by quick-check linting. Also, ROCm build failed.

16:16:03 /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/init.cpp:552:7: error: expected primary-expression before ‘.’ token
16:16:03   552 |       .def_readwrite("noCopy", &::c10d::AllgatherOptions::noCopy);
16:16:03       |       ^
16:16:03 /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/init.cpp:568:7: error: expected primary-expression before ‘.’ token
16:16:03   568 |       .def_readwrite("noCopy", &::c10d::ReduceScatterOptions::noCopy);
16:16:03       |       ^

Please address CI failures. Source code formatting changes were flagged by quick-check linting. Also, ROCm build failed.

16:16:03 /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/init.cpp:552:7: error: expected primary-expression before ‘.’ token
16:16:03   552 |       .def_readwrite("noCopy", &::c10d::AllgatherOptions::noCopy);
16:16:03       |       ^
16:16:03 /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/init.cpp:568:7: error: expected primary-expression before ‘.’ token
16:16:03   568 |       .def_readwrite("noCopy", &::c10d::ReduceScatterOptions::noCopy);
16:16:03       |       ^

@jeffdaily The above errors are fixed and committed. Let me know if there's anything that needs to be done for the "rocm build fail" or it was just wrt this error

athitten avatar Nov 02 '21 19:11 athitten

@athitten there were still some whitespace errors that I have fixed. Some mypy checks were failing that I needed to familiarize myself with and then fix. Otherwise, just waiting for one last ROCm CI check to finish and then we can merge.

jeffdaily avatar Nov 04 '21 21:11 jeffdaily

@athitten there were still some whitespace errors that I have fixed. Some mypy checks were failing that I needed to familiarize myself with and then fix. Otherwise, just waiting for one last ROCm CI check to finish and then we can merge.

@jeffdaily just saw that you have addressed these, was about to push in commits with the changes. Thank you very much for taking care of this.

athitten avatar Nov 05 '21 14:11 athitten

@jeffdaily @athitten just came across this...looks like we forgot to merge this. Will this PR cause our fork to diverge from upstream in this regard ,and is that what we want here?

jithunnair-amd avatar Jan 10 '22 05:01 jithunnair-amd

@jeffdaily @athitten just came across this...looks like we forgot to merge this. Will this PR cause our fork to diverge from upstream in this regard ,and is that what we want here?

@jithunnair-amd yes it does diverge and we would like to add it to support no_copy option in torch distributed. Here is the JIRA with more details on this: [https://ontrack-internal.amd.com/browse/SWDEV-306609] Let me know if you need any more information, thanks!

athitten avatar Jan 18 '22 17:01 athitten