awsome-distributed-training
awsome-distributed-training copied to clipboard
nccl-tests container: fix cuda driver mismatch
Issue #, if available: nccl-test with container image fails with system has unsupported display driver / cuda driver combination
.
Description of changes:
-
update cuda compat to fix error:
7: ip-10-1-113-84: Test CUDA failure common.cu:894 'system has unsupported display driver / cuda driver combination' 7: .. ip-10-1-113-84 pid 738939: Test failure common.cu:844
-
fix docker build variable expansion failure
-
ban
veth_def_agent
interface from NCCL consideration (SMHP)
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.