NCCL:Broadcast collectives are missing from the converted trace but present in the trace_link
Describe the Bug
After running a ResNet50 or TinyLlama2 workload on 4 ranks I see that in the Kineto trace at least one nccl:broadcast collective is observed. In the trace_link file the same collective is observed, but in the converted trace the collective is no longer present. Is this a normal behavior or is it an issue on the Chakra Converter side?
I looked in the converter implementation, but I did not observe any pointers that this should be done - dismiss broadcast collectives. Is there something I missed?
Steps to Reproduce
Using the Chakra version from 6 Sept, after the merge of commit #140.
Expected Behavior
See the nccl:broadcast collective in the converted trace.
Screenshots
This is the trace_link file, the broadcast collective is present.
This is the converted trace, in json format, no broadcast collective can be found - search result is at the bottom of the picture.
Tested the same steps with the latest version of Chakra, installed from repository, 16 Oct, the behavior is the same.
While looking into this issue I observed that the nccl:broadcast operation is a CPU operation and therefore is does not pass this check from pytorch_converter.py:
def get_protobuf_node_type_from_json_node(
self, json_node_map: Dict[int, PyTorchNode], json_node: PyTorchNode
) -> int:
"""
Determine the Protobuf node type from a Chakra node.
Args:
json_node_map (Dict[int, PyTorchNode]): Dictionary of JSON nodes.
json_node (PyTorchNode): The JSON node to determine the type of.
Returns:
int: The corresponding Chakra node type.
"""
if json_node.is_gpu_op():
if "ncclDevKernel_SendRecv" in json_node.name:
parent_node = json_node_map[json_node.parent]
keyword = (
json_node_map[parent_node.parent].name
if parent_node.name == "record_param_comms"
else parent_node.name
)
if "send" in keyword:
return COMM_SEND_NODE
if "recv" in keyword:
return COMM_RECV_NODE
if "ncclKernel" in json_node.name or "ncclDevKernel" in json_node.name:
return COMM_COLL_NODE
return COMP_NODE
What is the reason why this collective is not included in the converted trace? Are there any plans in the future to include broadcast operation in the trace? Thanks!
I'm seing similar issues with nccl:gather which seems to clear the if statement alex showcased above properly, but then fails to match against "known" collective operation strings in get_collective_comm_type()
https://github.com/mlcommons/chakra/pull/190
@fh-TurbaAI can you have a look at the draft PR related to this issue at https://github.com/mlcommons/chakra/pull/190 ?