druid icon indicating copy to clipboard operation
druid copied to clipboard

Old Coordinator Leader keeps emitting stale metrics for a specific datasource

Open aruraghuwanshi opened this issue 10 months ago • 8 comments

Affected Version

28.0.1

Description

After the coordinator leader election is concluded, sometimes the old coordinator leader keeps emitting stale metrics for a specific datasource, despite the new coordinator leader emitting the realtime metrics. This creates duplication and erroneous reporting.

Some details that were noted:

  • Its always the high ingestion volume datasource that goes into this problem state with 2 coordinators emitting that metric (One stale and one actual). All other datasources' metrics are emitted only by the new coordinator-leader.
  • No difference in logs emitted by the old coordinator leader for the problematic datasource vs the good ones.

Reference Image attached (Blue: Old coordinator metric for one specific datasource in a stuck/stale state; Green: New coordinator emitting the same metric with the realtime values)

Image

aruraghuwanshi avatar Feb 14 '25 22:02 aruraghuwanshi

Image

Metrics reporter was marked closed for that datasource during the leader transition, but the stale metrics are still being emitted by the old-coordinator leader.

aruraghuwanshi avatar Feb 15 '25 00:02 aruraghuwanshi

@aruraghuwanshi , I am not sure if you are referring to a metric emitted by Druid itself or some metric emitted by Kafka (since the logs you shared above indicate something originating in Kafka code). If it's the former, can you please share the stale metric names that are coming from the old coordinator leader?

kfaraz avatar Feb 17 '25 02:02 kfaraz

Attached two examples here. I'm referring to all the mentioned druid metrics in this Kafka section

Image Image

aruraghuwanshi avatar Feb 18 '25 23:02 aruraghuwanshi

Adding the metric names here for reference @kfaraz : ingest/kafka/lag ingest/kafka/maxLag ingest/kafka/avgLag ingest/kafka/partitionLag

aruraghuwanshi avatar Feb 18 '25 23:02 aruraghuwanshi

For more context, we've faced this issue a few times before and the only resolution seems to be to kill the old-coordinator leader pod. Once that pod restarts and stabilizes the stale metric disappears and the real-values of the kafka lag metric(s) are only emitted by the current coordiantor leader.

aruraghuwanshi avatar Feb 18 '25 23:02 aruraghuwanshi

Could you please check the service/heartbeat metric with dimension leader for the two coordinators and check if both of them are considering themselves to be leader at the same time?

Side note: Are you running coordinator and overlord as a single service?

kfaraz avatar Mar 06 '25 04:03 kfaraz

Hey @kfaraz , unfortunately we're not emitting the heartbeat metric but we did confirm that we only had one active-leader at the time of this Incident, as per the logs shown here.

| Are you running coordinator and overlord as a single service? Yes, that is accurate

Image

aruraghuwanshi avatar Mar 08 '25 01:03 aruraghuwanshi

This issue has been marked as stale due to 280 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the [email protected] list. Thank you for your contributions.

github-actions[bot] avatar Dec 15 '25 00:12 github-actions[bot]