chakra add collective dependencies as data

Currently, there is no explicit dependency between collectives belonging to the same process group. Their sequential execution is deducted only from the duration of their parent compute node and their own duration.

This PR proposes to append to the data_deps list the previous collective belonging to the same PG. I am attaching an example containing 5 All_Reduce in the same PG. Before the proposed change, each collective had only one node in the data_deps, i.e. the parent COMP_NODE named "record_param_comms". After the change, the first All_Reduce in the PG remains unchanged, but the next ones will have 2 nodes in data_deps: the COMP_NODE "record_param_comms" and the previous collective communication node.

Also attaching part of a diagram showing the collectives without explicit dependencies (they are all leaf nodes) and with the newly added dependencies. Cropped out nodes from above the collectives are the parent COMP_NODE. kineto.json new_chakra_jsonized.json original_chakra_jsonized.json p_trace_rank=0_.json old_visualization

Apr 03 '25 12:04 theodorbadea

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

Apr 03 '25 12:04 github-actions[bot]

Hi @theodorbadea . Thank you for this PR. I think adding data dependency between collective which have the same PG name is good feature to have. Let me take a look and run some test and get back to you.

Also, I need to understand better regarding the redundant check.

Apr 07 '25 19:04 JoongunPark

Hey @JoongunPark , thanks for your feedback on this PR. That check is redundant because child nodes are only added to the stack if they are not in the visited list:

Apr 08 '25 08:04 theodorbadea

Hey @JoongunPark , thanks for your feedback on this PR. That check is redundant because child nodes are only added to the stack if they are not in the visited list:

Ah Yes. You are right. I agree that original logic seems bit redundant and hard to understand. This PR looks good to me. I will be back soon after few testings. Thank you!

Apr 17 '25 15:04 JoongunPark

Hey @JoongunPark , thanks for your feedback on this PR. That check is redundant because child nodes are only added to the stack if they are not in the visited list:

Ah Yes. You are right. I agree that original logic seems bit redundant and hard to understand. This PR looks good to me. I will be back soon after few testings. Thank you!

Thank you for your patient @theodorbadea ! And I am sorry for delayed testing. I have tested this PR with three different traces. Resnet, GPT3, and DeepSeek. This code terminates without error and the results are running well on my ASTRA-Sim. This PR looks good to me. Could you please check and approve this PR @tushar-krishna @srinivas212 ? Thank you!

Result log from ASTRA-Sim

GPT3
sys[0] finished, 4876814000 cycles, exposed communication 10158000 cycles.
num_cpu_ops: 78613
num_gpu_ops: 9813
num_gpu_comms: 603
tics_cpu_ops: 4617349000
tics_gpu_ops: 4866656000
tics_gpu_comms: 753759692
Collective: ALL_REDUCE, # ops: 9, Bytes: 36, runtime: 474444
Collective: ALL_GATHER, # ops: 392, Bytes: 19761348608, runtime: 343905160
Collective: BROADCAST, # ops: 2, Bytes: 12, runtime: 11000
Collective: REDUCE_SCATTER, # ops: 200, Bytes: 161521860608, runtime: 409369088

DeepSeek
sys[0] finished, 27726132000 cycles, exposed communication 25816064000 cycles.
num_cpu_ops: 404788
num_gpu_ops: 80431
num_gpu_comms: 1792
tics_cpu_ops: 27726131000
tics_gpu_ops: 1910068000
tics_gpu_comms: 915764527
Collective: ALL_REDUCE, # ops: 18, Bytes: 136, runtime: 948888
Collective: ALL_GATHER, # ops: 23, Bytes: 284717568, runtime: 5603986
Collective: ALL_TO_ALL, # ops: 1728, Bytes: 86973087744, runtime: 897208704
Collective: REDUCE_SCATTER, # ops: 23, Bytes: 4555481088, runtime: 12002949

Resnet
sys[0] finished, 33805413000 cycles, exposed communication 30153502000 cycles.
num_cpu_ops: 5294
num_gpu_ops: 5971
num_gpu_comms: 3
tics_cpu_ops: 33805413000
tics_gpu_ops: 3651911000
tics_gpu_comms: 710708
Collective: ALL_REDUCE, # ops: 1, Bytes: 102228128, runtime: 516708
Collective: BROADCAST, # ops: 2, Bytes: 212904, runtime: 194000

May 15 '25 20:05 JoongunPark

This code terminates without error and the results are running well on my ASTRA-Sim.

After discussion with the group in our last Chakra meeting, we concluded to have more discussion with this PR.

Jun 02 '25 17:06 JoongunPark

This code terminates without error and the results are running well on my ASTRA-Sim.

After discussion with the group in our last Chakra meeting, we concluded to have more discussion with this PR.

@theodorbadea @JoongunPark Hello everyone. Have you understood how chakra handles data dependencies? I think in chakra, the convert_ctrl_dep_to_data_dep function only converts ctrl_dep to data_dep. So in the final et file, data_dep represents the calling and being called relationship between OPs, rather than the dependency relationship between data stream, for example: I trained a model with the configuration of pp=4 in 4 ranks, and many isend and irecv were used. At the beginning, rank1 needs to wait for the corresponding data received after the calculation of rank0 is completed before starting the calculation. Therefore, this dependency relationship of data streams is very important for simulating pipeline parallel. Without this dependency, the four ranks would be carried out simultaneously, which is obviously problematic. What do you think about this? Looking forward to your answers.

Jun 10 '25 07:06 9LLPPLL6

Currently, there is no explicit dependency between collectives belonging to the same process group. Their sequential execution is deducted only from the duration of their parent compute node and their own duration.

This PR proposes to append to the data_deps list the previous collective belonging to the same PG. I am attaching an example containing 5 All_Reduce in the same PG. Before the proposed change, each collective had only one node in the data_deps, i.e. the parent COMP_NODE named "record_param_comms". After the change, the first All_Reduce in the PG remains unchanged, but the next ones will have 2 nodes in data_deps: the COMP_NODE "record_param_comms" and the previous collective communication node.

Also attaching part of a diagram showing the collectives without explicit dependencies (they are all leaf nodes) and with the newly added dependencies. Cropped out nodes from above the collectives are the parent COMP_NODE. kineto.json new_chakra_jsonized.json original_chakra_jsonized.json p_trace_rank=0_.json

I observed the trace collected in the data parallel distributed training. I found that when different processes belonging to the same communication group perform allreduce, the recorded pg_name may not be equal. Therefore, it is not very accurate to judge the dependency based on pg_name The following are the traces I collected during the training with dp=2 and pp=2 using Megatron-LM in the 4xA6000

kineto_trace_zbpp_gpt.0.json kineto_trace_zbpp_gpt.1.json kineto_trace_zbpp_gpt.2.json kineto_trace_zbpp_gpt.3.json

Jun 13 '25 12:06 9LLPPLL6

@9LLPPLL6 my understanding is that two collectives from the same PG cannot overlap and in reality, at least in nccl's case, they don't because nccl takes care of scheduling them. We faced the scenario of encountering overlapping collectives in emulation and this is why I proposed this change, to explicitly capture these dependencies. There have been shared opinions about the fact that the engine should be responsible for this scheduling. In your traces, could you notice overlaps in terms of start+duration times between collectives in the same PG and collectives from different PGs?

Jun 13 '25 12:06 theodorbadea

@9LLPPLL6 my understanding is that two collectives from the same PG cannot overlap and in reality, at least in nccl's case, they don't because nccl takes care of scheduling them. We faced the scenario of encountering overlapping collectives in emulation and this is why I proposed this change, to explicitly capture these dependencies. There have been shared opinions about the fact that the engine should be responsible for this scheduling. In your traces, could you notice overlaps in terms of start+duration times between collectives in the same PG and collectives from different PGs?

Perhaps there was a deviation in my understanding.Thank you for your answer. I understand what you mean

Jun 14 '25 12:06 9LLPPLL6

Closing the PR for now. The consensus was that we will keep the notion that one processgroup can issue one collective at any given time. Let's reopen if we want to revisit.

Sep 29 '25 20:09 jinsun-yoo

add collective dependencies as data_deps

Result log from ASTRA-Sim