MS-AMP icon indicating copy to clipboard operation
MS-AMP copied to clipboard

Questions about error reporting

Open Mrzhang-dada opened this issue 2 years ago • 1 comments

What's the issue, what's expected?: Hello author, I had an nvlink reference error when I installed the environment myself,When I installed MSCCL using make -j src.build NVCC_GENCODE="-gencode=arch=compute_90,code=sm_90" an error occurred. Another question is whether this set of projects can be run on the H800.I did not use --privileged --net=host --ipc=host when creating a new container. Will this also have an impact?

How to reproduce it?: My system environment is ubuntu22.04, python 3.11, cuda 11.8, and torch 2.0.1. The hardware environment is 8-card H800.

Log message or shapshot?:

Additional information:

Mrzhang-dada avatar Nov 10 '23 01:11 Mrzhang-dada

Hi. Thanks for your attention to our work. Currently we don't have H800 node in hand, so we can't verify it. Have you tried latest MS-AMP docker? Not using --privileged --net=host --ipc=host should not be a problem if you only use it in single node.

sudo docker run -it -d --name=msampcu121 --privileged --net=host --ipc=host --gpus=all -v /:/hostroot ghcr.io/azure/msamp:main-cuda12.1 bash
sudo docker exec -it msampcu121 bash

tocean avatar Nov 13 '23 02:11 tocean

Close this issue since there is no activities for a long time.

tocean avatar Aug 02 '24 10:08 tocean