chakra icon indicating copy to clipboard operation
chakra copied to clipboard

preserve collectives

Open theodorbadea opened this issue 9 months ago • 8 comments

DRAFT PR: Discussion initiation, mostly addressing https://github.com/mlcommons/chakra/issues/161 Issue: some collectives are missing from the .et file, but are present in the kineto file.

It all starts from the host trace, these missing collectives have host operations with missing parents. To be more specific, the host ops corresponding to these collectives have a parent which is not part of the p_trace file. Specific example: host_op1: { id: 439, name: "record_param_comms", ctrl_deps: 438, rf_id: 115 } -> this corresponds to a ncclDevKernel_Broadcast host_op2: { id: 438, name: "c10d::broadcast_", ctrl_deps: 9 } host_op3: { id: 9, name: "DistributedDataParallel.forward", ctrl_deps: 4 }

So, host_op1 has host_op2 as parent, host_op2 has host_op3 as parent, and host_op3 has a node with id 4 as parent. However, host op with id: 4 is not present in the p_trace_file and so facebookresearch/param/et_replay will not properly build the children/parent links (as seen in execution_trace.py): if n.parent_id in self.nodes: # here there is no node with id equal to parent_id in the list of nodes self.nodes[n.parent_id].add_child(n) n.set_parent(self.nodes[n.parent_id]) When chakra loads the host trace, it will do a traversal starting from the root_node, but since there are some nodes which have parents not present in the file, this trace will be a graph with multiple components => chakra will load the "main" connected component, the one starting from the root node. For example, for a ResNet50 on 4 ranks that I've used to debug, it will load 2410/4484 host op nodes.

Attaching several relevant files generated with and without the changes from this branch. p_trace_rank0.json kineto_trace_rank0.json new_jsonized_chakra_trace.json new_merged.json old_jsonized_chakra_trace.json old_merged.json

cc: @winstonliu1111 @danmih-ixia @TaekyungHeo @tushar-krishna

theodorbadea avatar Mar 27 '25 18:03 theodorbadea

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

github-actions[bot] avatar Mar 27 '25 18:03 github-actions[bot]

Hey @theodorbadea,

Just wanted to confirm whether this PR is done or not, since it says "Draft PR" in the first line. If this is still a draft, could you save this as a draft instead?

spandoescode avatar Apr 14 '25 21:04 spandoescode

Hey @spandoescode , I put “draft” in the first comment because this PR contains just some commented out code and I would like to use it just to support my proposal and get feedback from the community for now. If we decide that this change is correct and should get integrated, I will do the proper code changes (a more elegant way of loading the host nodes and entirely remove the method “remove_dangling_nodes”). However, I would like to avoid marking as Draft because I am not actually working on it and I want to draw some attention (I feel that marking this as Draft will perhaps make the community ignore it as people might believe it’s still work in progress).

Looking forward to receiving your feedback on this proposal to load all the host nodes and not just the ones from the main connected component.

theodorbadea avatar Apr 14 '25 22:04 theodorbadea

@tushar-krishna, @srinivas212 can you please advise who can review this PR?

danmih-ixia avatar Apr 24 '25 13:04 danmih-ixia

Thank you for the suggestion! I think this fix will properly include the broadcast in the Chakra trace. However, I'm not sure if this correctly represents the dependency behavior. For example, in new_jsonized_chakra_trace.json, the broadcasts with id 116/117, 175/176, is dependent on id 5, but id 5 is not dependent on anything. I would assume in reality each of the broadcast would be dependent on some compute.

What do you think?

jinsun-yoo avatar Apr 24 '25 14:04 jinsun-yoo

Hey, @jinsun-yoo. Indeed, since the parent is missing, it will not depend on any compute node. This would be basically a reason why I initiated this discussion, to get feedback from the community about the correctness of this proposal...

theodorbadea avatar Apr 25 '25 11:04 theodorbadea

Hey, @jinsun-yoo. Indeed, since the parent is missing, it will not depend on any compute node. This would be basically a reason why I initiated this discussion, to get feedback from the community about the correctness of this proposal...

Hi @theodorbadea , thank you for the PR and sorry for joining this discussion lately. Originally, when Chakra loads host ops, it traverses the operations. Like @theodorbadea explained, this eliminates lots of host ops not connected each other. I believe this is important problem to solve. This PR basically preserves all the CPU ops in host traces and dangling chakra nodes in the Chakra trace.

I have few suggestions. (1) We need extra logic that searches dependency for broadcast and connect compute->broadcast (in this case) in better way. (2) We might still need remove_dangling_nodes rather than skipping it. Otherwise, all the dangling nodes will be scheduled in the beginning of simulation/replay. (3) We need to understand why some host nodes does not have dependency to others. That would help us to solve this problem.

JoongunPark avatar May 15 '25 22:05 JoongunPark

Hey, @jinsun-yoo. Indeed, since the parent is missing, it will not depend on any compute node. This would be basically a reason why I initiated this discussion, to get feedback from the community about the correctness of this proposal...

Hi @theodorbadea , thank you for the PR and sorry for joining this discussion lately. Originally, when Chakra loads host ops, it traverses the operations. Like @theodorbadea explained, this eliminates lots of host ops not connected each other. I believe this is important problem to solve. This PR basically preserves all the CPU ops in host traces and dangling chakra nodes in the Chakra trace.

I have few suggestions. (1) We need extra logic that searches dependency for broadcast and connect compute->broadcast (in this case) in better way. (2) We might still need remove_dangling_nodes rather than skipping it. Otherwise, all the dangling nodes will be scheduled in the beginning of simulation/replay. (3) We need to understand why some host nodes does not have dependency to others. That would help us to solve this problem.

Hi @theodorbadea, thank you for the talk today. We weren't able to discuss a lot for this PR, but I believe this PR is very important and must be fixed. Hope we can discuss this in detail in our next meeting. Thank you!

JoongunPark avatar May 19 '25 19:05 JoongunPark

@jinsun-yoo I managed to reproduce the situation you were targeting with your comment when training AlexNet with FSDP. This is what the original .et looks like: image

This is what the .et containing the orphan collectives looks like: image

Considering the two illustrations and analyzing the AlexNet in FSDP with size_based_auto_wrap_policy and 1.500.000 minimum number of parameters for wrapping, it seems that the entire forward pass part and the final AllReduce (produced by explicitly calling all_reduce(loss, op=dist.ReduceOp.SUM)) are missing. I believe that 3 modules of the AlexNet (the 3 layers from the classifier) and perhaps their nn.Sequential container get wrapped, so the 4 AllGather and 4 ReduceScatter visible in the original .et are explained. However, the AllGathers corresponding to the forward pass are entirely missing. Additionaly, your remark was correct: forcing Chakra to load these orphan nodes is not enough because dependencies are also required. In the second illustration, AllReduce has no dependency and so it may get executed first, when in fact it should get executed last.

cc: @JoongunPark k_trace_rank=0_.json p_trace_rank=0_.json

Later Edit: @TaekyungHeo (I think you were interested in seeing an example), please find attached these AlexNet traces. In the p_trace file, node with "id": 4 is missing, however 4 appears as parent of several ops, consequently these ops will be dropped.

theodorbadea avatar May 21 '25 13:05 theodorbadea

@jinsun-yoo I managed to reproduce the situation you were targeting with your comment when training AlexNet with FSDP. This is what the original .et looks like: image

This is what the .et containing the orphan collectives looks like: image

Considering the two illustrations and analyzing the AlexNet in FSDP with size_based_auto_wrap_policy and 1.500.000 minimum number of parameters for wrapping, it seems that the entire forward pass part and the final AllReduce (produced by explicitly calling all_reduce(loss, op=dist.ReduceOp.SUM)) are missing. I believe that 3 modules of the AlexNet (the 3 layers from the classifier) and perhaps their nn.Sequential container get wrapped, so the 4 AllGather and 4 ReduceScatter visible in the original .et are explained. However, the AllGathers corresponding to the forward pass are entirely missing. Additionaly, your remark was correct: forcing Chakra to load these orphan nodes is not enough because dependencies are also required. In the second illustration, AllReduce has no dependency and so it may get executed first, when in fact it should get executed last.

cc: @JoongunPark k_trace_rank=0_.json p_trace_rank=0_.json

Hi @theodorbadea. I realized that this PR might work better than current version along with your next PR #191 . As we weren't able to finalize our discussion in our previous meeting, it would be great to move forward after our next Chakra meeting.

Also, I have a quick question. Were you able to observe missing operations other than collectives? If yes, do you have any idea to enforce dependencies for those ops? Thank you!

JoongunPark avatar May 23 '25 20:05 JoongunPark

Hey, @JoongunPark . After some more digging, I found out that what's captured in the resulting chakra.et corresponds to thread tid=2 and what's missing to thread tid=1. About your question, yes, there are also missing compute nodes. So, baiscally, in the attached example, what we can see in the .et right now is corresponding to: { "id": 345, "name": "[pytorch|profiler|execution_trace|thread]", "ctrl_deps": 1, ... "tid": 2 } and what is missing corresponds to: { "id": 2, "name": "[pytorch|profiler|execution_trace|thread]", "ctrl_deps": 1, ... "tid": 1 } I noticed that node with "id": 4 is missing from the p_trace, therefore all the nodes which have "ctrl_deps": 4 will be dropped. Then, I did a little experiment and manually replaced "ctrl_deps": 4 with "ctrl_deps": 2, this way basically "forcing" the orphan graph components to be part of the graph by setting {"id": 2, "name": "[pytorch|profiler|execution_trace|thread]"} as their parent (which I suspect it truly is, but there are some missing nodes in-between).

Now I intend to check the PyTorch profiler and see if there is something missing there, perhaps some setting, which might justify the missing host operations. I find it very strange that it is able to record {"id": 2, "name": "[pytorch|profiler|execution_trace|thread]", "ctrl_deps": 1,} and some other distant children of this node, but fail to record what's in-between. Attaching here a new mermaid chart after hardcoding the proper parent: image

theodorbadea avatar May 27 '25 18:05 theodorbadea

Hey, @JoongunPark . After some more digging, I found out that what's captured in the resulting chakra.et corresponds to thread tid=2 and what's missing to thread tid=1. About your question, yes, there are also missing compute nodes. So, baiscally, in the attached example, what we can see in the .et right now is corresponding to: { "id": 345, "name": "[pytorch|profiler|execution_trace|thread]", "ctrl_deps": 1, ... "tid": 2 } and what is missing corresponds to: { "id": 2, "name": "[pytorch|profiler|execution_trace|thread]", "ctrl_deps": 1, ... "tid": 1 } I noticed that node with "id": 4 is missing from the p_trace, therefore all the nodes which have "ctrl_deps": 4 will be dropped. Then, I did a little experiment and manually replaced "ctrl_deps": 4 with "ctrl_deps": 2, this way basically "forcing" the orphan graph components to be part of the graph by setting {"id": 2, "name": "[pytorch|profiler|execution_trace|thread]"} as their parent (which I suspect it truly is, but there are some missing nodes in-between).

Now I intend to check the PyTorch profiler and see if there is something missing there, perhaps some setting, which might justify the missing host operations. I find it very strange that it is able to record {"id": 2, "name": "[pytorch|profiler|execution_trace|thread]", "ctrl_deps": 1,} and some other distant children of this node, but fail to record what's in-between. Attaching here a new mermaid chart after hardcoding the proper parent: image

Thank you for confirmation ! @theodorbadea . I think your approach makes sense. We may need to choose either (1) Find hidden dependences we might miss or (2) Forcing ops to have dependencies like you did. Let's discuss about this during the meeting.

JoongunPark avatar Jun 02 '25 18:06 JoongunPark