druid icon indicating copy to clipboard operation
druid copied to clipboard

supervisor: Emit active/publishing task counts

Open adithyachakilam opened this issue 1 year ago • 2 comments

Description

Adding this metric would help see how much of time a supervisor is spending to publish tasks, It is important to keep this time low because auto scaling would be skipped in during this period which could cause increased lag.

Release note

Adds new metrics: task/supervisor/active/count and task/supervisor/publishing/count.


Key changed/added classes in this PR
  • SeekableStreamSupervisor.java

This PR has:

  • [x] been self-reviewed.
    • [ ] using the concurrency checklist (Remove this item if the PR doesn't have any relation to concurrency.)
  • [x] added documentation for new or modified features or behaviors.
  • [x] a release note entry in the PR description.
  • [ ] added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • [ ] added or updated version, license, or notice information in licenses.yaml
  • [ ] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • [ ] added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • [ ] added integration tests.
  • [ ] been tested in a test Druid cluster.

adithyachakilam avatar Oct 07 '24 18:10 adithyachakilam

There has to be some docs changes. How are you going to infer the time in publishing tasks (btw what does supervisor publishing a task mean exactly)? And how do you keep that time low assuming you can find the time is high.

abhishekagarwal87 avatar Oct 08 '24 03:10 abhishekagarwal87

@adithyachakilam , leaving some suggestions here even though the PR is in draft right now.

how much of time a supervisor is spending to publish tasks

Could you please elaborate? What time are you referring to exactly? The supervisor is just a thread which wakes up and launches or kills tasks and updates some metadata.

If you want to capture the time a task spends in publishing segments, then the correct metric for that would be something like ingest/publish/time (in the same vein as ingest/handoff/time and ingest/merge/time).

If you want to capture the number of tasks currently in publishing phase etc, then as @suneet-s has suggested, emitting the current phase/state of a streaming task in its heartbeat makes sense. But it would need some changes from the current approach:

  • The status is not an intrinsic property of a task and must not be a part of the Task interface. You can inject the runner to build up the heartbeat map in the CliPeon.heartbeatDimensions() method.
  • For non-streaming tasks, instead of always emitting UNKNOWN, do not emit any value for this dimension.

kfaraz avatar Oct 08 '24 05:10 kfaraz

This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If you think that's incorrect or this pull request should instead be reviewed, please simply write any comment. Even if closed, you can still revive the PR at any time or discuss it on the [email protected] list. Thank you for your contributions.

github-actions[bot] avatar Dec 08 '24 00:12 github-actions[bot]

This pull request/issue has been closed due to lack of activity. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

github-actions[bot] avatar Jan 06 '25 00:01 github-actions[bot]