chakra icon indicating copy to clipboard operation
chakra copied to clipboard

Communication Nodes Incorrectly Marked as COMP_NODE

Open lurw2000 opened this issue 1 year ago • 1 comments

I have followed the user guide and converted Pytorch ET + Kineto Trace into Chakra ET. However, it seems that communication nodes are incorrectly marked as computation nodes, for example:

{
  "id": "20",
  "name": "nccl:broadcast",
  "type": "COMP_NODE",
  "ctrlDeps": [
    "19"
  ],
  "dataDeps": [
    "19"
  ],
  "inputs": {
    "values": "[[21, 22, 0, 5, 8, 'cuda:7']]",
    "shapes": "[[5]]",
    "types": "['Tensor(long int)']"
  },
  "outputs": {
    "values": "[]",
    "shapes": "[]",
    "types": "[]"
  },
  "attr": [
    {
      "name": "rf_id",
      "int64Val": "19"
    },
    {
      "name": "fw_parent",
      "int64Val": "0"
    },
    {
      "name": "seq_id",
      "int64Val": "-1"
    },
    {
      "name": "scope",
      "int64Val": "7"
    },
    {
      "name": "tid",
      "int64Val": "1"
    },
    {
      "name": "fw_tid",
      "int64Val": "0"
    },
    {
      "name": "op_schema",
      "stringVal": ""
    },
    {
      "name": "is_cpu_op",
      "boolVal": true
    },
    {
      "name": "stream",
      "int64Val": "0"
    }
  ]
}

At https://github.com/mlcommons/chakra/blob/main/src/converter/pytorch_converter.py#L341

if json_node.is_gpu_op():
    if "ncclDevKernel_SendRecv" in json_node.name:
        parent_node = json_node_map[json_node.parent]
        keyword = (
            json_node_map[parent_node.parent].name
            if parent_node.name == "record_param_comms"
            else parent_node.name
        )
        if "send" in keyword:
            return COMM_SEND_NODE
        if "recv" in keyword:
            return COMM_RECV_NODE
    if "ncclKernel" in json_node.name or "ncclDevKernel" in json_node.name:
        return COMM_COLL_NODE
    return COMP_NODE

It seems that a node must be a GPU node before it becomes a communication node. However, the logic of json_node.is_gpu_op() is really confusing. At https://github.com/mlcommons/chakra/blob/main/src/converter/pytorch_node.py#L149

def is_gpu_op(self) -> bool:
    """
    Check if the node is a GPU operator.
    
    Returns
        bool: True if the node is a GPU operator, False otherwise.
    """
    return self.cat is not None

However, it seems that the "cat" attribute would be dropped during the Pytorch ET + Kineto Trace link so that none of the nodes would be marked as a GPU node, and consequently marked as COMP_NODE.

I do not know which part is not as expected, but the logic of json_node.is_gpu_op() seems weird to me.

lurw2000 avatar Dec 06 '24 04:12 lurw2000

Hey @lurw2000, does this issue still persist? I opened a recent (ET + Pytorch) linked trace and I can confirm that I can see the cat string is carried over.

Image

spandoescode avatar Apr 14 '25 22:04 spandoescode