Questions about error reporting
What's the issue, what's expected?: Hello author, I had an nvlink reference error when I installed the environment myself,When I installed MSCCL using make -j src.build NVCC_GENCODE="-gencode=arch=compute_90,code=sm_90" an error occurred. Another question is whether this set of projects can be run on the H800.I did not use --privileged --net=host --ipc=host when creating a new container. Will this also have an impact?
How to reproduce it?: My system environment is ubuntu22.04, python 3.11, cuda 11.8, and torch 2.0.1. The hardware environment is 8-card H800.
Log message or shapshot?:
Additional information:
Hi. Thanks for your attention to our work. Currently we don't have H800 node in hand, so we can't verify it. Have you tried latest MS-AMP docker? Not using --privileged --net=host --ipc=host should not be a problem if you only use it in single node.
sudo docker run -it -d --name=msampcu121 --privileged --net=host --ipc=host --gpus=all -v /:/hostroot ghcr.io/azure/msamp:main-cuda12.1 bash
sudo docker exec -it msampcu121 bash
Close this issue since there is no activities for a long time.