chakra icon indicating copy to clipboard operation
chakra copied to clipboard

wrong linkage between GPU ops and their CPU parents

Open theodorbadea opened this issue 6 months ago • 6 comments

Introduced in: https://github.com/mlcommons/chakra/pull/185

Previously, linkage was done based on the Ev idx attribute of the nodes and on the timestamps. This PR introduced linkage based on External id attribute, but this fails to properly link the collective communication nodes to their CPU parent.

For example, a ncclDevKernel... operation (which is a collective communication) should be linked to a nccl:.. operation (which is a runtime op, as far as I believe) and nccl:... should be linked to a record_param_comms operation. This was the behavior before the PR.

Now, a ncclDevKernel... is wrongly linked to a record_param_comms, thus basically ending up with the nccl:... runtime operation (which is a COMP_NODE) being executed in parallel with the ncclDevKernel....

The cause of the wrong linkage is the usage of external id, as can be seen in the following example:

ncclDevKernel_AllGather_RING_LL with External id 17 nccl:_all_gather_base with External id 18 record_param_comms with External id 17

The dependencies should be as follows: record_param_comms -> nccl:_all_gather_base ->ncclDevKernel_AllGather_RING_LL, but because External id does not match, the ncclDevKernel_AllGather_RING_LL is linked to record_param_comms instead of nccl:_all_gather_base.

Furthermore, this is confirmed by analyzing the timestamps of each operation: the closest to ncclDevKernel_AllGather_RING_LL in terms of time is nccl:_all_gather_base and not record_param_comms.

@JoongunPark , any particular reason for changing Ev idx linkage to External id? Is Ev idx unreliable?

cc: @winstonliu1111 @danmih-ixia

theodorbadea avatar Jun 13 '25 12:06 theodorbadea

possibly related with https://github.com/mlcommons/chakra/issues/186

theodorbadea avatar Jun 13 '25 12:06 theodorbadea

The dependencies should be as follows: record_param_comms -> nccl:_all_gather_base ->ncclDevKernel_AllGather_RING_LL, but because External id does not match, the ncclDevKernel_AllGather_RING_LL is linked to record_param_comms instead of nccl:_all_gather_base.

Hi @theodorbadea, You pointed out the right one. Such dependency structure was intended.

I have updated ev_idx to External ID+rf_id after having few discussions with Brian@Meta. At that time we have observed mismatching between CPU and GPU ops. From what I had observed, ev_idx was unreliable to match the operations at least for the traces that I have.

So, Brian has updated PyTorch (Issue: https://github.com/pytorch/pytorch/issues/122833) to increase External ID+rf_id in the synchronized manner so that we can match the ops.

My PR (item 4) is here. https://github.com/mlcommons/chakra/pull/185

JoongunPark avatar Jun 16 '25 18:06 JoongunPark

Hey, @JoongunPark , thanks for sharing this. As far as I can see, @briancoutinho did not align, i.e. increment, the external id, but added record function id to the p trace.

The issue that I still notice is the fact that ncclDevKernel nodes from kineto do not have record function id which can be matched to what Brian added in p trace, only cpu_ops from kineto have this record function id. My first guess is that what Brian did for p trace ops must also be done for kernel ops inside kineto trace. Did not have enough time to dig deeper into this, but at a first glance this is what I suspect.

theodorbadea avatar Jun 17 '25 15:06 theodorbadea

Hi @theodorbadea .

In your example, we match these two. ncclDevKernel_AllGather_RING_LL with External id 17 record_param_comms with External id 17

Which we essentially ignored nccl:_all_gather_base with External id 18.

This was intended and our best at that point. Without kernels like "nccl:_all_gather_base", I have checked my traces and it seems they have matched external id and rf_id in my traces.

From my understanding, if we need to align External id all together we should be able to fix PyTorch side. Do I understand correctly? Thank you!

JoongunPark avatar Jun 26 '25 00:06 JoongunPark

Hey, @JoongunPark . Yep, I think in order to have this fixed and also have a robust linkage, PyTorch needs to also dump the rf id for kernel nodes. I understand that currently this is the best approach, however the resulting Chakra trace does not accurately reflect the workload. This is what I am getting now:

Image

This is how I think it should look:

Image

The question remains if it s possible to add the record function id to ncclDevKernel nodes and if so, still to confirm that the record function id of the ncclDevKernel node matches the one of the nccl: node.

Can you share some example nodes from your kineto which show what you mean by "they have matched external id and rf_id in my traces"? I m not sure I understand this. In my case, this is what I get in my kineto:

Image

Image

theodorbadea avatar Jun 26 '25 12:06 theodorbadea

Hi @theodorbadea,

I think the new structure makes sense, and you're right—that’s the ideal graph.

Could you take a look at the trace I shared in #185?

That said, there are a few components I'd like to discuss:

In the trace, we can see that ncclDevKernel_AllGather_RING_LL and nccl:_all_gather_base are overlapping. So we either need to select one of them (or set the runtime of one to zero), or substitute the runtime of the node with the difference: nccl:_all_gather_base - ncclDevKernel_AllGather_RING_LL, similar to how Chakra handles overlapping CPU ops.

The mismatch between rf_id and external_id might be caused by the trace collection logic—particularly from warmup or wait.

I’m not sure if this issue can be fully addressed on the Chakra side. If it has to be, what do you think would be a viable solution?

JoongunPark avatar Jun 30 '25 18:06 JoongunPark