Jithun Nair
Jithun Nair
@jeffdaily We have some internal documentation that highlights some of the differences in enabling PyTorch extensions for ROCm. Shall I put that together into something we can publish on the...
@pruthvistony I think we discussed this before, but just to make sure: could the build_amd.py be part of hipify-torch so that it doesn't have to be added to the hipifying...
From https://github.com/microsoft/DeepSpeed/actions/runs/8474231174/job/23220238944#step:9:16730: `85 failed, 820 passed, 178 skipped, 88 warnings, 20 errors in 14061.19s (3:54:21)` @rraminen Let's post a breakup of the 85 failures here for better assessment of next...
> List of errors are here: (most are NCCL and probably should not be running) > > ``` > FAILED unit/runtime/pipe/test_topology.py::TestDistributedTopology::test_stage_to_global - torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, internal error, NCCL...
@rraminen Formatting checks failed with trailing whitespace error: https://github.com/microsoft/DeepSpeed/actions/runs/9115455800/job/25064328013?pr=5401#step:5:60 Should be a straightforward one, can you please check?
@Hobbes-Le-Chat I don't think you actually captured the error snippet, all we see are warnings and then: ``` 17 warnings and 2 errors generated when compiling for gfx1030. error: command...
@Hobbes-Le-Chat Thanks, the log file helped! Btw, I think you should update the title of this issue to "Build issues on ROCm with random_ltd extension" or something, since I don't...
Yes, these are not yet supported in ROCm. We are working on adding support in ROCm. Additionally, we are also considering adding a way to disable unsupported extensions by default,...
Commands I used to reproduce the linker error: hipcc -o super_simple_reducemax_kernel.o -c super_simple_reducemax_kernel.cu hipcc super_simple_reducemax_kernel.o Linker error: ``` rocm-user@a69b1b7130d8:~/pytorch__hc2_v4__clean/TEMP$ hipcc super_simple_reducemax_kernel.o LLVM ERROR: Cannot select: 0x2f61a40: v2i16 = SMAX3 0x2f619d8,...