torch-ccl
torch-ccl copied to clipboard
Issue for the new NGC images
Hi! Recently I was looking at ngc images sites and noticed
Starting with the 22.11 PyTorch NGC container, miniforge is removed and all Python packages are installed
in the default Python environment. In case you depend on Conda-specific packages, which might not be
available on PyPI, we recommend building these packages from source. A workaround is to manually install
a Conda package manager, and add the conda path to your PYTHONPATH for example, using export
PYTHONPATH="/opt/conda/lib/python3.8/site-packages" if your Conda package manager was installed in
/opt/conda.
It seems that ngc images will no longer provide the conda environment and pytorch related files will be moved to the python environment. When I docker run the new images such as nvcr.io/nvidia/pytorch:22.11-py3, I found that there is no c10d related head files in python environment in directory /usr/local/lib/python3.8/dist-packages/torch/include. But ProcessCCL.hpp must use head file <torch/csrc/distributed/c10d/Utils.hpp>. So how do we solve this problem so that we can use torch-ccl in the latest ngc image?
which pytorch and torch-ccl version do you use?
which pytorch and torch-ccl version do you use? ngc images: nvcr.io/nvidia/pytorch:22.11-py3 pytorch version: 1.13.0a0+936e930 torch-ccl: 1.13
it seems that your codebase is older than the 1.13.0 tag, and pytorch change the c10d distributed path in the https://github.com/pytorch/pytorch/pull/85780, so you may have 2 choices to fix this issue:
- use the 1.13.0 release code
- try to use torch-ccl-1.12.100 release.
it seems that your codebase is older than the 1.13.0 tag, and pytorch change the c10d distributed path in the pytorch/pytorch#85780, so you may have 2 choices to fix this issue:
- use the 1.13.0 release code
- try to use torch-ccl-1.12.100 release.
Thx for your reply! My problem was solved by the first option. The second option didn't work, but that's not torch-ccl or pytorch's fault. What I mean is that the compiled pytorch provided by the ngc image no longer contains C++ header files. I had to recompile pytorch for torch-ccl to compile correctly.