Nick Kuo
Nick Kuo
Hi @DNYZH, I might need further context on this. Could you share partial reproducible CUDA code that is having issue converting? Let's align our configuration first.
Sorry for the delayed response, I was caught up in urgent internal issue. With your repro steps, I tested two scenarios: ## Common code files: define.cu.hip (Hipify) ``` #include #include...
Hi @DNYZH, I'm closing this issue for now. Feel free to reopen or create a new issue should you encounter any further issues. Thanks!
Updated my current findings in https://github.com/ROCm/rccl/issues/2022. It looks like a RCCL problem for now, keep tracking there.
Update for current debug status: The offending call of VLLM is actually not `NCCLInitComm`, but is a `hipEventQuery` during graph capture. This explains why reproducer cannot duplicate but VLLM still...
Hi @zejunchen-zejun, Could you please help test the following script on a Nvidia system? This script is able to repro the problem on AMD platform, just like to make sure...
The behavior of NV vs AMD on Watchdog is exactly the same, so if the reproducer will not fail on NV, it is likely a HIP runtime problem we need...
Hi @thananon, I inspected this issue further and it seems to be this line offending stream capture: https://github.com/ROCm/rccl/blob/62ab7a22d741ab4f214b6b185b77d030ba7bb85b/src/init.cc#L2442 Is this `hipFree` call expected here? Call stack from torch: Disassembling `ncclCommInitRankDev`...
Hi @zejunchen-zejun, thanks for trying out on NV system. I dug deeper into PyTorch and my initial speculation is this is expected. In the repro script you provided, the NCCL...
Hi @da-phil, yes I agree with @ianbmacdonald that you should let TTM to manage instead of configuring carveout in BIOS. It might sound counter-intuitive, but if you carve out a...