torch-ccl icon indicating copy to clipboard operation
torch-ccl copied to clipboard

Issue for the new NGC images

Open PhdShi opened this issue 2 years ago • 4 comments

Hi! Recently I was looking at ngc images sites and noticed

Starting with the 22.11 PyTorch NGC container, miniforge is removed and all Python packages are installed 
in the default Python environment. In case you depend on Conda-specific packages, which might not be 
available on PyPI, we recommend building these packages from source. A workaround is to manually install 
a Conda package manager, and add the conda path to your PYTHONPATH for example, using export 
PYTHONPATH="/opt/conda/lib/python3.8/site-packages" if your Conda package manager was installed in 
/opt/conda.

It seems that ngc images will no longer provide the conda environment and pytorch related files will be moved to the python environment. When I docker run the new images such as nvcr.io/nvidia/pytorch:22.11-py3, I found that there is no c10d related head files in python environment in directory /usr/local/lib/python3.8/dist-packages/torch/include. But ProcessCCL.hpp must use head file <torch/csrc/distributed/c10d/Utils.hpp>. So how do we solve this problem so that we can use torch-ccl in the latest ngc image?

PhdShi avatar Jan 05 '23 03:01 PhdShi

which pytorch and torch-ccl version do you use?

liangan1 avatar Jan 05 '23 05:01 liangan1

which pytorch and torch-ccl version do you use? ngc images: nvcr.io/nvidia/pytorch:22.11-py3 pytorch version: 1.13.0a0+936e930 torch-ccl: 1.13

PhdShi avatar Jan 05 '23 08:01 PhdShi

it seems that your codebase is older than the 1.13.0 tag, and pytorch change the c10d distributed path in the https://github.com/pytorch/pytorch/pull/85780, so you may have 2 choices to fix this issue:

  1. use the 1.13.0 release code
  2. try to use torch-ccl-1.12.100 release.

liangan1 avatar Jan 09 '23 00:01 liangan1

it seems that your codebase is older than the 1.13.0 tag, and pytorch change the c10d distributed path in the pytorch/pytorch#85780, so you may have 2 choices to fix this issue:

  1. use the 1.13.0 release code
  2. try to use torch-ccl-1.12.100 release.

Thx for your reply! My problem was solved by the first option. The second option didn't work, but that's not torch-ccl or pytorch's fault. What I mean is that the compiled pytorch provided by the ngc image no longer contains C++ header files. I had to recompile pytorch for torch-ccl to compile correctly.

PhdShi avatar Jan 09 '23 01:01 PhdShi