Add support for distributed optimizer
Enable no_copy argument in c10d files for distributed optimizers
Please address CI failures. Source code formatting changes were flagged by quick-check linting. Also, ROCm build failed.
16:16:03 /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/init.cpp:552:7: error: expected primary-expression before ‘.’ token 16:16:03 552 | .def_readwrite("noCopy", &::c10d::AllgatherOptions::noCopy); 16:16:03 | ^ 16:16:03 /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/init.cpp:568:7: error: expected primary-expression before ‘.’ token 16:16:03 568 | .def_readwrite("noCopy", &::c10d::ReduceScatterOptions::noCopy); 16:16:03 | ^
Please address CI failures. Source code formatting changes were flagged by quick-check linting. Also, ROCm build failed.
16:16:03 /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/init.cpp:552:7: error: expected primary-expression before ‘.’ token 16:16:03 552 | .def_readwrite("noCopy", &::c10d::AllgatherOptions::noCopy); 16:16:03 | ^ 16:16:03 /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/init.cpp:568:7: error: expected primary-expression before ‘.’ token 16:16:03 568 | .def_readwrite("noCopy", &::c10d::ReduceScatterOptions::noCopy); 16:16:03 | ^
@jeffdaily The above errors are fixed and committed. Let me know if there's anything that needs to be done for the "rocm build fail" or it was just wrt this error
@athitten there were still some whitespace errors that I have fixed. Some mypy checks were failing that I needed to familiarize myself with and then fix. Otherwise, just waiting for one last ROCm CI check to finish and then we can merge.
@athitten there were still some whitespace errors that I have fixed. Some mypy checks were failing that I needed to familiarize myself with and then fix. Otherwise, just waiting for one last ROCm CI check to finish and then we can merge.
@jeffdaily just saw that you have addressed these, was about to push in commits with the changes. Thank you very much for taking care of this.
@jeffdaily @athitten just came across this...looks like we forgot to merge this. Will this PR cause our fork to diverge from upstream in this regard ,and is that what we want here?
@jeffdaily @athitten just came across this...looks like we forgot to merge this. Will this PR cause our fork to diverge from upstream in this regard ,and is that what we want here?
@jithunnair-amd yes it does diverge and we would like to add it to support no_copy option in torch distributed. Here is the JIRA with more details on this: [https://ontrack-internal.amd.com/browse/SWDEV-306609] Let me know if you need any more information, thanks!