druid icon indicating copy to clipboard operation
druid copied to clipboard

Include datasource name in the segment publish errors

Open abhishekrb19 opened this issue 7 months ago • 3 comments

When a segment publish error occurs, it's hard to tell which supervisor to reset if there are multiple supervisors consuming from the same stream. This change includes the datasource name to the error messages. Ideally we would also include the supervisor ID in the error, but since the supervisor ID is currently the same as the datasource name, this should suffice for now.

This PR has:

  • [x] been self-reviewed.
  • [x] added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.

abhishekrb19 avatar Jun 03 '25 22:06 abhishekrb19

@abhishekrb19 , the SegmentPublishResult is anyway sent back to the task which already belongs to a specific datasource. The task logs this failure but I don't think it is ever logged on the Overlord, unless an exception occurs. I am not sure how including the datasource name in the publish result message would help.

kfaraz avatar Jun 04 '25 02:06 kfaraz

That’s right from a task logs perspective, @kfaraz. However, on 32.0.1, I noticed some error and warning logs on the Overlord that lacked the additional context, which prompted me to add the datasource name to the error messages to make them more actionable and help with troubleshooting. For example:

ERROR [task-runner-0-priority-0] org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner - Encountered exception while running task.
java.util.concurrent.ExecutionException: org.apache.druid.java.util.common.ISE: Failed to publish segments because of [org.apache.druid.error.DruidException: The new start metadata state[KafkaDataSourceMetadata{SeekableStreamStartSequenceNumbers=SeekableStreamStartSequenceNumbers{stream=...multiTopicPartition=false}=4278168656}}}]. Try resetting the supervisor.]

abhishekrb19 avatar Jun 04 '25 04:06 abhishekrb19

Fair enough, @abhishekrb19 ! I have run into this several times myself.

For this case, I would advise including the datasource name in the log on the supervisor side by using task.getDatasource() rather than putting it in the SegmentPublishResult.

In fact, I feel we should do this for several of the log messages coming out of the supervisor (maybe in a follow up).

kfaraz avatar Jun 04 '25 05:06 kfaraz

This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If you think that's incorrect or this pull request should instead be reviewed, please simply write any comment. Even if closed, you can still revive the PR at any time or discuss it on the [email protected] list. Thank you for your contributions.

github-actions[bot] avatar Aug 04 '25 00:08 github-actions[bot]

This pull request/issue has been closed due to lack of activity. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

github-actions[bot] avatar Sep 02 '25 00:09 github-actions[bot]