torch-ccl icon indicating copy to clipboard operation
torch-ccl copied to clipboard

Issue for the new NGC images

Open PhdShi opened this issue 2 years ago • 4 comments

Hi! Recently I was looking at ngc images sites and noticed

Starting with the 22.11 PyTorch NGC container, miniforge is removed and all Python packages are installed 
in the default Python environment. In case you depend on Conda-specific packages, which might not be 
available on PyPI, we recommend building these packages from source. A workaround is to manually install 
a Conda package manager, and add the conda path to your PYTHONPATH for example, using export 
PYTHONPATH="/opt/conda/lib/python3.8/site-packages" if your Conda package manager was installed in 
/opt/conda.

It seems that ngc images will no longer provide the conda environment and pytorch related files will be moved to the python environment. When I docker run the new images such as nvcr.io/nvidia/pytorch:22.11-py3, I found that there is no c10d related head files in python environment in directory /usr/local/lib/python3.8/dist-packages/torch/include. But ProcessCCL.hpp must use head file <torch/csrc/distributed/c10d/Utils.hpp>. So how do we solve this problem so that we can use torch-ccl in the latest ngc image?

PhdShi avatar Jan 05 '23 03:01 PhdShi