awsome-distributed-training icon indicating copy to clipboard operation
awsome-distributed-training copied to clipboard

nccl-tests container: fix cuda driver mismatch

Open verdimrc opened this issue 9 months ago • 7 comments

Issue #, if available: nccl-test with container image fails with system has unsupported display driver / cuda driver combination.

Description of changes:

  • update cuda compat to fix error:

     7: ip-10-1-113-84: Test CUDA failure common.cu:894 'system has unsupported display driver / cuda driver combination'
     7:  .. ip-10-1-113-84 pid 738939: Test failure common.cu:844
    
  • fix docker build variable expansion failure

  • ban veth_def_agent interface from NCCL consideration (SMHP)

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

verdimrc avatar May 07 '24 07:05 verdimrc