wrong linkage between GPU ops and their CPU parents
Introduced in: https://github.com/mlcommons/chakra/pull/185
Previously, linkage was done based on the Ev idx attribute of the nodes and on the timestamps. This PR introduced linkage based on External id attribute, but this fails to properly link the collective communication nodes to their CPU parent.
For example, a ncclDevKernel... operation (which is a collective communication) should be linked to a nccl:.. operation (which is a runtime op, as far as I believe) and nccl:... should be linked to a record_param_comms operation. This was the behavior before the PR.
Now, a ncclDevKernel... is wrongly linked to a record_param_comms, thus basically ending up with the nccl:... runtime operation (which is a COMP_NODE) being executed in parallel with the ncclDevKernel....
The cause of the wrong linkage is the usage of external id, as can be seen in the following example:
ncclDevKernel_AllGather_RING_LL with External id 17 nccl:_all_gather_base with External id 18 record_param_comms with External id 17
The dependencies should be as follows: record_param_comms -> nccl:_all_gather_base ->ncclDevKernel_AllGather_RING_LL, but because External id does not match, the ncclDevKernel_AllGather_RING_LL is linked to record_param_comms instead of nccl:_all_gather_base.
Furthermore, this is confirmed by analyzing the timestamps of each operation: the closest to ncclDevKernel_AllGather_RING_LL in terms of time is nccl:_all_gather_base and not record_param_comms.
@JoongunPark , any particular reason for changing Ev idx linkage to External id? Is Ev idx unreliable?
cc: @winstonliu1111 @danmih-ixia
possibly related with https://github.com/mlcommons/chakra/issues/186
The dependencies should be as follows: record_param_comms -> nccl:_all_gather_base ->ncclDevKernel_AllGather_RING_LL, but because External id does not match, the ncclDevKernel_AllGather_RING_LL is linked to record_param_comms instead of nccl:_all_gather_base.
Hi @theodorbadea, You pointed out the right one. Such dependency structure was intended.
I have updated ev_idx to External ID+rf_id after having few discussions with Brian@Meta.
At that time we have observed mismatching between CPU and GPU ops.
From what I had observed, ev_idx was unreliable to match the operations at least for the traces that I have.
So, Brian has updated PyTorch (Issue: https://github.com/pytorch/pytorch/issues/122833) to increase External ID+rf_id in the synchronized manner so that we can match the ops.
My PR (item 4) is here. https://github.com/mlcommons/chakra/pull/185
Hey, @JoongunPark , thanks for sharing this. As far as I can see, @briancoutinho did not align, i.e. increment, the external id, but added record function id to the p trace.
The issue that I still notice is the fact that ncclDevKernel nodes from kineto do not have record function id which can be matched to what Brian added in p trace, only cpu_ops from kineto have this record function id. My first guess is that what Brian did for p trace ops must also be done for kernel ops inside kineto trace. Did not have enough time to dig deeper into this, but at a first glance this is what I suspect.
Hi @theodorbadea .
In your example, we match these two. ncclDevKernel_AllGather_RING_LL with External id 17 record_param_comms with External id 17
Which we essentially ignored nccl:_all_gather_base with External id 18.
This was intended and our best at that point.
Without kernels like "nccl:_all_gather_base", I have checked my traces and it seems they have matched external id and rf_id in my traces.
From my understanding, if we need to align External id all together we should be able to fix PyTorch side. Do I understand correctly? Thank you!
Hey, @JoongunPark . Yep, I think in order to have this fixed and also have a robust linkage, PyTorch needs to also dump the rf id for kernel nodes. I understand that currently this is the best approach, however the resulting Chakra trace does not accurately reflect the workload. This is what I am getting now:
This is how I think it should look:
The question remains if it s possible to add the record function id to ncclDevKernel nodes and if so, still to confirm that the record function id of the ncclDevKernel node matches the one of the nccl: node.
Can you share some example nodes from your kineto which show what you mean by "they have matched external id and rf_id in my traces"? I m not sure I understand this. In my case, this is what I get in my kineto:
Hi @theodorbadea,
I think the new structure makes sense, and you're right—that’s the ideal graph.
Could you take a look at the trace I shared in #185?
That said, there are a few components I'd like to discuss:
In the trace, we can see that ncclDevKernel_AllGather_RING_LL and nccl:_all_gather_base are overlapping. So we either need to select one of them (or set the runtime of one to zero), or substitute the runtime of the node with the difference:
nccl:_all_gather_base - ncclDevKernel_AllGather_RING_LL,
similar to how Chakra handles overlapping CPU ops.
The mismatch between rf_id and external_id might be caused by the trace collection logic—particularly from warmup or wait.
I’m not sure if this issue can be fully addressed on the Chakra side. If it has to be, what do you think would be a viable solution?