helix icon indicating copy to clipboard operation
helix copied to clipboard

Reporting top state handoff time even if it's beyond threshold value

Open rahulrane50 opened this issue 2 years ago • 2 comments

Describe the bug

In TopStateHandoffReportStage, when top state is detected in this stage and helix has previous missing top state record (code pointer), helix finds out startTime and endTime of handsoff and report handsoff duration. But this is reported only if it's beyond set threshold (code pointer). Ideally this handsoff should still be reported. Now it's debatable if this handsoff is considered as successful or failed and we can discuss that. But either way it should be reported IMHO

To Reproduce

Set the missing_top_state_threshold in cluster config to some value. Now when top state handsoff happens from one host to another and it takes more than set threshold then helix won't update handsoff duration metrics but would mark this handsoff as "failed" and increment failedTopStateHandsOffCounter.

Expected behavior

Ideally handsoff duration should be reported but we can mark this handsoff as failed one.

Additional context

Add any other context about the problem here.

rahulrane50 avatar Feb 23 '23 18:02 rahulrane50