awsome-distributed-training
awsome-distributed-training copied to clipboard
EKS NCCL Tests limited to 8GB
The NCCL tests for K8 have a limit of 8GB for the container, this is causing a OOM issue when run.
https://github.com/aws-samples/awsome-distributed-training/blob/a99d6cd0f48abeecfa7d5a7710af4eb0a7079752/micro-benchmarks/nccl-tests/kubernetes/nccl-tests.yaml#L91
This results in an issue that looks like:
[1,2]<stdout>:test-nccl-worker-0:199:263 [2] process_err_completion:1562 NCCL WARN NET/OFI Request 0x7f08c0001d30 completed with error. RC: 5. Error: Invalid receiver queue pair number (QPN) My EFA addr: fi_addr_efa://[fe80::30:43ff:fe60:456b]:0:657145598 My host id: i-00ee91ea93e985ee0 Peer EFA addr: fi_addr_efa://[fe80::ed:61ff:fe1d:a3b3]:0:229342137 Peer host id: i-08d8a345a75ab52a0. Completed length: 0, Request: { dev: 2, size: 0, state: CREATED, type: SEND_CTRL }
but is really an OOM issue. We need to bump up the memory limit to something more reasonable.