Communication Nodes Incorrectly Marked as COMP_NODE
I have followed the user guide and converted Pytorch ET + Kineto Trace into Chakra ET. However, it seems that communication nodes are incorrectly marked as computation nodes, for example:
{
"id": "20",
"name": "nccl:broadcast",
"type": "COMP_NODE",
"ctrlDeps": [
"19"
],
"dataDeps": [
"19"
],
"inputs": {
"values": "[[21, 22, 0, 5, 8, 'cuda:7']]",
"shapes": "[[5]]",
"types": "['Tensor(long int)']"
},
"outputs": {
"values": "[]",
"shapes": "[]",
"types": "[]"
},
"attr": [
{
"name": "rf_id",
"int64Val": "19"
},
{
"name": "fw_parent",
"int64Val": "0"
},
{
"name": "seq_id",
"int64Val": "-1"
},
{
"name": "scope",
"int64Val": "7"
},
{
"name": "tid",
"int64Val": "1"
},
{
"name": "fw_tid",
"int64Val": "0"
},
{
"name": "op_schema",
"stringVal": ""
},
{
"name": "is_cpu_op",
"boolVal": true
},
{
"name": "stream",
"int64Val": "0"
}
]
}
At https://github.com/mlcommons/chakra/blob/main/src/converter/pytorch_converter.py#L341
if json_node.is_gpu_op():
if "ncclDevKernel_SendRecv" in json_node.name:
parent_node = json_node_map[json_node.parent]
keyword = (
json_node_map[parent_node.parent].name
if parent_node.name == "record_param_comms"
else parent_node.name
)
if "send" in keyword:
return COMM_SEND_NODE
if "recv" in keyword:
return COMM_RECV_NODE
if "ncclKernel" in json_node.name or "ncclDevKernel" in json_node.name:
return COMM_COLL_NODE
return COMP_NODE
It seems that a node must be a GPU node before it becomes a communication node. However, the logic of json_node.is_gpu_op() is really confusing.
At https://github.com/mlcommons/chakra/blob/main/src/converter/pytorch_node.py#L149
def is_gpu_op(self) -> bool:
"""
Check if the node is a GPU operator.
Returns
bool: True if the node is a GPU operator, False otherwise.
"""
return self.cat is not None
However, it seems that the "cat" attribute would be dropped during the Pytorch ET + Kineto Trace link so that none of the nodes would be marked as a GPU node, and consequently marked as COMP_NODE.
I do not know which part is not as expected, but the logic of json_node.is_gpu_op() seems weird to me.
Hey @lurw2000, does this issue still persist? I opened a recent (ET + Pytorch) linked trace and I can confirm that I can see the cat string is carried over.