torch-ccl
torch-ccl copied to clipboard
Issue for the new NGC images
Hi! Recently I was looking at ngc images sites and noticed
Starting with the 22.11 PyTorch NGC container, miniforge is removed and all Python packages are installed
in the default Python environment. In case you depend on Conda-specific packages, which might not be
available on PyPI, we recommend building these packages from source. A workaround is to manually install
a Conda package manager, and add the conda path to your PYTHONPATH for example, using export
PYTHONPATH="/opt/conda/lib/python3.8/site-packages" if your Conda package manager was installed in
/opt/conda.
It seems that ngc images will no longer provide the conda environment and pytorch related files will be moved to the python environment. When I docker run the new images such as nvcr.io/nvidia/pytorch:22.11-py3, I found that there is no c10d related head files in python environment in directory /usr/local/lib/python3.8/dist-packages/torch/include. But ProcessCCL.hpp must use head file <torch/csrc/distributed/c10d/Utils.hpp>. So how do we solve this problem so that we can use torch-ccl in the latest ngc image?