Nick Kuo comments

Results 15 comments of


                                            Nick Kuo

lld: error: undefined protected symbol: cCriticality >>> referenced by /tmp/FinalizeGPU-e78412/FinalizeGPU-gfx928.o:(FinalizeKeff_GPUg())

Hi @DNYZH, I might need further context on this. Could you share partial reproducible CUDA code that is having issue converting? Let's align our configuration first.

lld: error: undefined protected symbol: cCriticality >>> referenced by /tmp/FinalizeGPU-e78412/FinalizeGPU-gfx928.o:(FinalizeKeff_GPUg())

Sorry for the delayed response, I was caught up in urgent internal issue. With your repro steps, I tested two scenarios: ## Common code files: define.cu.hip (Hipify) ``` #include #include...

lld: error: undefined protected symbol: cCriticality >>> referenced by /tmp/FinalizeGPU-e78412/FinalizeGPU-gfx928.o:(FinalizeKeff_GPUg())

Hi @DNYZH, I'm closing this issue for now. Feel free to reopen or create a new issue should you encounter any further issues. Thanks!

[Block Issue][hip graph]: HIP error: operation not permitted when stream is capturing

Updated my current findings in https://github.com/ROCm/rccl/issues/2022. It looks like a RCCL problem for now, keep tracking there.

[Block Issue][hip graph]: HIP error: operation not permitted when stream is capturing

Update for current debug status: The offending call of VLLM is actually not `NCCLInitComm`, but is a `hipEventQuery` during graph capture. This explains why reproducer cannot duplicate but VLLM still...

[Block Issue][hip graph]: HIP error: operation not permitted when stream is capturing

Hi @zejunchen-zejun, Could you please help test the following script on a Nvidia system? This script is able to repro the problem on AMD platform, just like to make sure...

[Block Issue][hip graph]: HIP error: operation not permitted when stream is capturing

The behavior of NV vs AMD on Watchdog is exactly the same, so if the reproducer will not fail on NV, it is likely a HIP runtime problem we need...

[collective op][cuda graph] capture collective ops but got an HIP error: operation not permitted when stream is capturing

Hi @thananon, I inspected this issue further and it seems to be this line offending stream capture: https://github.com/ROCm/rccl/blob/62ab7a22d741ab4f214b6b185b77d030ba7bb85b/src/init.cc#L2442 Is this `hipFree` call expected here? Call stack from torch: Disassembling `ncclCommInitRankDev`...

[collective op][cuda graph] capture collective ops but got an HIP error: operation not permitted when stream is capturing

Hi @zejunchen-zejun, thanks for trying out on NV system. I dug deeper into PyTorch and my initial speculation is this is expected. In the repro script you provided, the NCCL...

[Issue]: Memory access fault by GPU node-1 (Agent handle: ...) on address (nil). Reason: Page not present or supervisor privilege

Hi @da-phil, yes I agree with @ianbmacdonald that you should let TTM to manage instead of configuring carveout in BIOS. It might sound counter-intuitive, but if you carve out a...