Burrow icon indicating copy to clipboard operation
Burrow copied to clipboard

ISSUE 314: use notifier threshold to configure evaluator ShowAll

Open nickdevp opened this issue 7 years ago • 17 comments

Hello, This pull request is a proposed fix for https://github.com/linkedin/Burrow/issues/341. Prior to this change, when the 'notifier' coordinator submits an 'EvaluatorRequest', the 'ShowAll' parameter defaults to 'false'. The result is that not all StatusOK partitions are sent to the notification endpoints (i.e., 'http' and 'email'). The changes in this PR use the 'threshold' configuration property within the notifier section to set the 'ShowAll' parameter of the evaluator request. Thus, if the user has set the threshold=1, that notification endpoint will receive StatusOK partitions.

nickdevp avatar Feb 16 '18 16:02 nickdevp

Looks like there is a test which is checking for the ShowAll flag. This is the error I'm seeing in the travis-ci logs: assert.False(t, request.ShowAll, "Expected ShowAll to be false")

Perhaps I am missing some context here.. why does this test expect it to be false?

nickdevp avatar Feb 16 '18 17:02 nickdevp

The test checks that all parameters are valid. If you change the assumptions, the tests need to be changed as well.

However, this isn't going to work. You're setting the value of ShowAll at the coordinator level - that is, for all notifiers - based on the value of an individual notifier. If you have a notifier that has threshold set to 1, and another that has threshold set to 2, you will end up with undesired behavior for the second notifier.

toddpalino avatar Feb 22 '18 19:02 toddpalino

The 'ShowAll' flag at the coordinator level seems to only control the evaluation request. The threshold checks are still done at in the 'notifyModule' function if I am reading this correctly: https://github.com/linkedin/Burrow/blob/master/core/internal/notifier/coordinator.go#L556

nickdevp avatar Feb 26 '18 00:02 nickdevp

May have gotten the link wrong above. Please look at the following in core/internal/notifier/coordinator.go: // Only send a notification if the current status is above the module's threshold if int(status.Status) < viper.GetInt("notifier."+module.GetName()+".threshold") { return }

nickdevp avatar Feb 26 '18 00:02 nickdevp

Looks like the above failed due to package github.com/goreleaser/nfpm upgrading to go 1.10.

nickdevp avatar Feb 26 '18 15:02 nickdevp

The problem is not where the threshold check is done. It's that the notifier, by default, expects ShowAll to have been false so only the errored partitions are in the Partitions slice.

toddpalino avatar Feb 26 '18 21:02 toddpalino

I didn't understand your comment above. May you please clarify? If you feel that this is the wrong approach, let me know if you have an alternative in mind. When I run Burrow in my environment with the changes in this PR, I do see StatusOk partitions coming into my http endpoint.

nickdevp avatar Feb 27 '18 14:02 nickdevp

This is really good for making burrow reporting status for all the partitions when threshold is set to 1. In my scenario, I need to report the status of every partition to TSDB ( i.e influxdb) no matter whether its health is good or bad. So I think Burrow should provide some ways of sending this metrics out, rather than making some tricks with the notification.

alexandnpu avatar Mar 15 '18 10:03 alexandnpu

I would also like to have this feature. Our application teams want to get notifications on all their consumers on a consistent basis and want to see how offset are tracking for each consumer. Having the partition information disappear when the status goes back to OK is not great for them.

chasse-code avatar Apr 26 '18 13:04 chasse-code

Yes, @nickdevp, that's exactly the problem. You'll also get StatusOk partitions if you set the threshold higher with this patch in place. That's the undesirable behavior I'm talking about.

toddpalino avatar Apr 30 '18 19:04 toddpalino

@toddpalino I am not seeing the behavior you described. If the threshold is set to 1 for notifier N1, but higher for another notifier N2, then N2 will not see StatusOk partitions.

nickdevp avatar May 10 '18 17:05 nickdevp

Coverage Status

Coverage increased (+0.04%) to 74.066% when pulling d8f77cc9063b587c55f76aefebd32c551ce66181 on nickdevp:ISSU-341 into fecab1ea88779cdecedd382c1ed0f6c5a7ff5e0f on linkedin:master.

coveralls avatar May 15 '18 14:05 coveralls

May I get an update on this PR? I would like to know how we can move forward on this issue.

nickdevp avatar Jun 06 '18 14:06 nickdevp

@toddpalino to mitigate the risk of this PR, perhaps we can add another configuration option to enable this feature? If you have any suggestions, I'd be open to making changes

nickdevp avatar Jul 07 '18 00:07 nickdevp

@toddpalino I also need Notifier to report all partitions, including those with StatusOK. How can I achieve this?

Currently .Result.Partitions is empty for send-close event.

akamac avatar Feb 15 '19 14:02 akamac

The only way this could possibly be brought in is if you altered checkAndSendResponseToModules to perform major surgery on the evaluation response when ShowAll is true, in order to assure that the only partitions that are sent to the notifier are the ones that are in an errored state when the threshold is higher than OK.

This has to be accompanied by tests that show that the current behavior is honored in all cases except where the threshold is set to OK. This would need to include the case of multiple notifiers, where one is set to OK and one is set to a higher threshold.

toddpalino avatar Mar 02 '19 03:03 toddpalino

Any updates on this one?

richardwalsh avatar Mar 29 '21 15:03 richardwalsh