Logan Adams

Results 294 comments of Logan Adams

FYI @rraminen and @jithunnair-amd - the AMD tests should be running again (linked above and in CI)

> From https://github.com/microsoft/DeepSpeed/actions/runs/8474231174/job/23220238944#step:9:16730: `85 failed, 820 passed, 178 skipped, 88 warnings, 20 errors in 14061.19s (3:54:21)` > > @rraminen Let's post a breakup of the 85 failures here for better...

List of errors are here: (most are NCCL and probably should not be running) ``` FAILED unit/runtime/pipe/test_topology.py::TestDistributedTopology::test_stage_to_global - torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, internal error, NCCL version 2.17.1 FAILED unit/runtime/half_precision/test_fp16.py::TestZeroEmptyPartition::test[True-1]...

> > List of errors are here: (most are NCCL and probably should not be running) > > ``` > > FAILED unit/runtime/pipe/test_topology.py::TestDistributedTopology::test_stage_to_global - torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, internal...

Abandoning in favor of !5491 that updates to ROCm 6.

Hi @delock - FYI could you resolve the merge conflicts on this PR so it can be reviewed/tests run?

> > Hi @delock - FYI could you resolve the merge conflicts on this PR so it can be reviewed/tests run? > > Hi @loadams. The conflicts have been resolved....

> @loadams @mrwyattii Hi, could you help to trigger a CI for this PR? thanks! Done @YizhouZ - could you run the pre-commit formatter to pass the formatting check? Thanks

> @tjruwase @loadams It seems _nv-torch-latest-v100 / unit-tests_ tests failed somehow in today's merging commit. I did not see any error msg in log, is it something related to CI...