kafka icon indicating copy to clipboard operation
kafka copied to clipboard

KAFKA-19168: MirrorMaker kafka_metrics_count metric is always incorrect

Open Yunyung opened this issue 8 months ago • 7 comments

Provide tags (connector name and task index) so that we get a kafka.connect.mirror:type=kafka-metrics-count metric per task.

  • Set the connector name and task index in the metrics for MirrorSourceConnector and MirrorCheckpointConnector, and add corresponding tests.
  • Add the MirrorCheckpointMetricsTest.java unit test file.

Jira: KAFKA-19168

Yunyung avatar Apr 26 '25 11:04 Yunyung

A label of 'needs-attention' was automatically added to this PR in order to raise the attention of the committers. Once this issue has been triaged, the triage label should be removed to prevent this automation from happening again.

github-actions[bot] avatar May 04 '25 03:05 github-actions[bot]

A label of 'needs-attention' was automatically added to this PR in order to raise the attention of the committers. Once this issue has been triaged, the triage label should be removed to prevent this automation from happening again.

github-actions[bot] avatar May 07 '25 03:05 github-actions[bot]

Thanks for the PR!

This is effectively changing metrics so this requires a small KIP. You can find the process on https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals

mimaison avatar May 16 '25 10:05 mimaison

Sure, I'll create a KIP. One thing I want to discuss is whether this situation is expected behavior.

Take this screenshot for example. Since task0 (B→A) creates metrics for one topic, the kafka.connect.mirror:type=kafka-metrics-count metric is set to 13. However, it might be set to 37 if task0 (A→B), which creates metrics for three topics, runs after (B→A) and overrides it. kafka_metrics_count metric can't distinguish between A→B and B→A for the task0 in this example. Screenshot from 2025-05-16 18-54-31

Yunyung avatar May 16 '25 11:05 Yunyung

In the Jira I suggested using the connector instance name and the task number. This should be unique in a Connect cluster.

mimaison avatar May 16 '25 11:05 mimaison

In the Jira I suggested using the connector instance name and the task number. This should be unique in a Connect cluster.

Hi @mimaison, what I meant above is: Cluster A -> B creates its own Metrics instance, and Cluster B -> A creates its own Metrics instance. The kafka-metrics-count metric collides between the two instances, even with the connector name and task index included. Is that the correct behavior? Please correct me if I'm missing anything. Thank you.

# connect-mirror-maker.properties
clusters = A, B

A->B.enabled = true
B->A.enabled = true

Yunyung avatar Jun 13 '25 14:06 Yunyung

Good point! Maybe in MirrorHerder we should create more unique connector names, for example using <SOURCE_ALIAS>-><TARGET_ALIAS>|<CONNECTOR_CLASS> instead of just the connector class name.

mimaison avatar Jun 18 '25 13:06 mimaison

That's a valid solution for me. It will need to prepend <SOURCE_ALIAS>-><TARGET_ALIAS>| to the connector name, whether it's the default connector name or a user-defined name using mirror connector name (I'm not sure if this would seem a bit strange to the user). Another approach is to add SOURCE_ALIAS and TARGET_ALIAS as default tags, so these two tags apply to all metrics, just like before, but this time including kafka_metrics_count.

Both approaches can solve the issue. Which one do you prefer?

Yunyung avatar Jun 20 '25 12:06 Yunyung

Another approach is to add SOURCE_ALIAS and TARGET_ALIAS as default tags, so these two tags apply to all metrics, just like before, but this time including kafka_metrics_count.

Good idea, that's would make metrics more consistent and easier to query in monitoring tools.

mimaison avatar Jun 25 '25 09:06 mimaison

Putting together a KIP based on our agreement, thanks for the discussion.

Yunyung avatar Jun 25 '25 10:06 Yunyung

Here is the KIP: https://cwiki.apache.org/confluence/x/vomKFQ. I also created the discussion thread a few days ago: https://lists.apache.org/thread/vdhfxlzkb8z7gpgky69y9qxw9cqjpog8. Thanks.

Yunyung avatar Aug 14 '25 03:08 Yunyung

This PR is being marked as stale since it has not had any activity in 90 days. If you would like to keep this PR alive, please leave a comment asking for a review. If the PR has merge conflicts, update it with the latest from the base branch.

If you are having difficulty finding a reviewer, please reach out on the [mailing list](https://kafka.apache.org/contact).

If this PR is no longer valid or desired, please feel free to close it. If no activity occurs in the next 30 days, it will be automatically closed.

github-actions[bot] avatar Nov 12 '25 03:11 github-actions[bot]