Jithun Nair

Results 170 comments of Jithun Nair

@jeffdaily We have some internal documentation that highlights some of the differences in enabling PyTorch extensions for ROCm. Shall I put that together into something we can publish on the...

@pruthvistony I think we discussed this before, but just to make sure: could the build_amd.py be part of hipify-torch so that it doesn't have to be added to the hipifying...

From https://github.com/microsoft/DeepSpeed/actions/runs/8474231174/job/23220238944#step:9:16730: `85 failed, 820 passed, 178 skipped, 88 warnings, 20 errors in 14061.19s (3:54:21)` @rraminen Let's post a breakup of the 85 failures here for better assessment of next...

> List of errors are here: (most are NCCL and probably should not be running) > > ``` > FAILED unit/runtime/pipe/test_topology.py::TestDistributedTopology::test_stage_to_global - torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, internal error, NCCL...

@rraminen Formatting checks failed with trailing whitespace error: https://github.com/microsoft/DeepSpeed/actions/runs/9115455800/job/25064328013?pr=5401#step:5:60 Should be a straightforward one, can you please check?

@Hobbes-Le-Chat I don't think you actually captured the error snippet, all we see are warnings and then: ``` 17 warnings and 2 errors generated when compiling for gfx1030. error: command...

@Hobbes-Le-Chat Thanks, the log file helped! Btw, I think you should update the title of this issue to "Build issues on ROCm with random_ltd extension" or something, since I don't...

Yes, these are not yet supported in ROCm. We are working on adding support in ROCm. Additionally, we are also considering adding a way to disable unsupported extensions by default,...

Commands I used to reproduce the linker error: hipcc -o super_simple_reducemax_kernel.o -c super_simple_reducemax_kernel.cu hipcc super_simple_reducemax_kernel.o Linker error: ``` rocm-user@a69b1b7130d8:~/pytorch__hc2_v4__clean/TEMP$ hipcc super_simple_reducemax_kernel.o LLVM ERROR: Cannot select: 0x2f61a40: v2i16 = SMAX3 0x2f619d8,...