Theodor Badea
Theodor Badea
Hey, @JoongunPark , thanks for sharing this. As far as I can see, @briancoutinho did not align, i.e. increment, the external id, but added record function id to the p...
Hey, @JoongunPark . Yep, I think in order to have this fixed and also have a robust linkage, PyTorch needs to also dump the rf id for kernel nodes. I...
Hey, @sunboyZgz . Can you share your code use to capture the traces? The profiler part. Would be interesting to see how you ended up having only cpu nodes.
@32HD can you please try https://github.com/mlcommons/chakra/pull/190 ? It may be related.
Can you check your kineto to see if you can find such cudaLaunchKernelExC with same correlation as your failing collective? 