chakra icon indicating copy to clipboard operation
chakra copied to clipboard

NCCL:Broadcast collectives are missing from the converted trace but present in the trace_link

Open alexseceks opened this issue 1 year ago • 5 comments

Describe the Bug

After running a ResNet50 or TinyLlama2 workload on 4 ranks I see that in the Kineto trace at least one nccl:broadcast collective is observed. In the trace_link file the same collective is observed, but in the converted trace the collective is no longer present. Is this a normal behavior or is it an issue on the Chakra Converter side?

I looked in the converter implementation, but I did not observe any pointers that this should be done - dismiss broadcast collectives. Is there something I missed?

Steps to Reproduce

Using the Chakra version from 6 Sept, after the merge of commit #140.

Expected Behavior

See the nccl:broadcast collective in the converted trace.

Screenshots

This is the trace_link file, the broadcast collective is present. Screenshot 2024-10-16 at 14 15 15 This is the converted trace, in json format, no broadcast collective can be found - search result is at the bottom of the picture. Screenshot 2024-10-16 at 14 17 39

alexseceks avatar Oct 16 '24 11:10 alexseceks

Tested the same steps with the latest version of Chakra, installed from repository, 16 Oct, the behavior is the same.

alexseceks avatar Oct 16 '24 11:10 alexseceks

While looking into this issue I observed that the nccl:broadcast operation is a CPU operation and therefore is does not pass this check from pytorch_converter.py:

    def get_protobuf_node_type_from_json_node(
        self, json_node_map: Dict[int, PyTorchNode], json_node: PyTorchNode
    ) -> int:
        """
        Determine the Protobuf node type from a Chakra node.

        Args:
            json_node_map (Dict[int, PyTorchNode]): Dictionary of JSON nodes.
            json_node (PyTorchNode): The JSON node to determine the type of.

        Returns:
            int: The corresponding Chakra node type.
        """
        if json_node.is_gpu_op():
            if "ncclDevKernel_SendRecv" in json_node.name:
                parent_node = json_node_map[json_node.parent]
                keyword = (
                    json_node_map[parent_node.parent].name
                    if parent_node.name == "record_param_comms"
                    else parent_node.name
                )
                if "send" in keyword:
                    return COMM_SEND_NODE
                if "recv" in keyword:
                    return COMM_RECV_NODE
            if "ncclKernel" in json_node.name or "ncclDevKernel" in json_node.name:
                return COMM_COLL_NODE
        return COMP_NODE

What is the reason why this collective is not included in the converted trace? Are there any plans in the future to include broadcast operation in the trace? Thanks!

alexseceks avatar Oct 17 '24 14:10 alexseceks

I'm seing similar issues with nccl:gather which seems to clear the if statement alex showcased above properly, but then fails to match against "known" collective operation strings in get_collective_comm_type()

image

fh-TurbaAI avatar Oct 21 '24 12:10 fh-TurbaAI

https://github.com/mlcommons/chakra/pull/190

theodorbadea avatar Mar 27 '25 18:03 theodorbadea

@fh-TurbaAI can you have a look at the draft PR related to this issue at https://github.com/mlcommons/chakra/pull/190 ?

winstonliu1111 avatar Mar 28 '25 16:03 winstonliu1111