zejunchen-zejun

Results 16 comments of zejunchen-zejun

Here are the logs I get with the log env flags. How can I get more info? ``` 51884395::3:hip_event.cpp :487 : 547211151990 us: ihipEventQuery: Returned hipErrorStreamCaptureUnsupported : 51884402::3:hip_event.cpp :494 :...

Hi, @amd-nicknick @ppanchad-amd Could you help take a look? Or do you know who may be familiar with hip graph issue? Thank you.

I deep dive the issue and kick off the issue to the RCCL. https://github.com/ROCm/rccl/issues/2022

Hi, @satyanveshd @amd-nicknick Let's track the issue here. RCCL is ok for now I think. On our application side, the code below is calling the torch.dist.all_reduce, which is totally and...

Hi, @amd-nicknick Thank you for help. We have verified the env flag and it works! When using this flag, the hipEventQuery will not be called in torch.dist op. Our application...

Hi, @amd-nicknick Thank you for help! First of all, we will run your reproducer on B200 from world size 1 to 8 to test the B200's behavior. We will check...

Hi, @amd-nicknick @satyanveshd Sorry for the late response. I verified the reproducer on the B200 machine and here is the log: ``` root@GPUA81E:/home/zejchen/graph_issue# python -u reproducer.py /usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml...

Hi, @iassiour @amd-nicknick Thank you so much for having a fix. May I know if the fix can pass the reproducer here? BTW, how long will we get your fix?...

Thank you @iassiour When it is included in the ROCm release, we will verify your fix and remove the work around in our application level. Thank you for help!

I have met the same issue when building the CUTLASS on B200. Let verify the potential fix.