frost icon indicating copy to clipboard operation
frost copied to clipboard

Devise scheme for coordinating alerts

Open hwine opened this issue 5 years ago • 3 comments

Currently, processing of service reports is done only by create_bugs.py which only does Bugzilla. For GitHub, I (already) have code for opening GitHub issues.

For simple cases, the run.sh can just be tweaked to only invoke the appropriate script. However, cases get non-simple pretty fast, and some routing becomes challenging.

How should the various alerting services interact? Ideally, we'd share as much code as possible.

Here's a current example from the GitHub world:

  1. first step is to open a GitHub Issue, if possible
  2. If Issues are not enabled (e.g. Treeherder), then a Bugzilla bug needs to be opened.

Enhancements I can envision

  • If there's been no action for a certain amount of time, an escalation should be issued. Escalation is likely via a different medium, e.g. Bugzilla or email.
  • Some conditions might call for a direct page to on-call.

hwine avatar Jul 05 '19 20:07 hwine

How should the various alerting services interact? Ideally, we'd share as much code as possible.

I think this makes sense. In the current model, could we combine the create_bugs.py script with your existing Github issues code into a single cli tool/script? I see nothing wrong with combining the "Github world" features you mentioned with what create_bugs.py is currently doing.

If there's been no action for a certain amount of time, an escalation should be issued. Escalation is likely via a different medium, e.g. Bugzilla or email.

If this makes sense to automate for certain services (e.g. Github), then that sounds great to me. For AWS and GCP, this wouldn't really work, as the issues require manual triage and very tailored escalation on a per issue basis.

Some conditions might call for a direct page to on-call.

I'd like to discuss what those cases would/could be as I may be missing something, but at the current moment my opinion would be that direct pages shouldn't occur via pytest-services / nightly scans. Direct pages should only come from something that is near-real-time like the fraud pipeline or event driven like fx-sig-verify.

On another note, something that I would like to see (that could potentially relate to automated escalation) is metrics on issues discovered and remediated. For the AWS tests, there is a tracker bug that basically serves this purpose, but it would be awesome to be able to quickly see the number of issues discovered, marked as invalid, and marked as resolved. This probably exists to a certain extent already within the metrics pipeline, but with more and more services being added to pytest-services, it might make sense to standardize this to a certain extent so we can have more accurate metrics.

ajvb avatar Jul 05 '19 21:07 ajvb

could we combine the create_bugs.py script with your existing Github issues code into a single cli tool/script? ... If [automated escalation] makes sense to automate for certain services (e.g. Github), then that sounds great to me. For AWS and GCP, this wouldn't really work, as the issues require manual triage and very tailored escalation on a per issue basis.

Makes sense -- given this, I'd suggest importable modules for bugzilla, GitHub issues, email, slack, etc. What remains in pytest services is a check specific to the test suite that implements the fancy logic, if any.

Some conditions might call for a direct page to on-call.

I'd like to discuss what those cases would/could be as I may be missing something, but at the current moment my opinion would be that direct pages shouldn't occur via pytest-services / nightly scans.

I agree. We'd cross that bridge if/when it makes sense.

On another note, [we should collect] metrics on issues discovered and remediated.

Great idea -- extracted to mozilla-services/foxsec#1365

hwine avatar Jul 08 '19 17:07 hwine

Makes sense -- given this, I'd suggest importable modules for bugzilla, GitHub issues, email, slack, etc. What remains in pytest services is a check specific to the test suite that implements the fancy logic, if any.

Sounds good to me. In my opinion we should keep the "alerting" wholly separate from the tests themselves to make it easier for anyone to use these tests on there own as an auditing tool.

ajvb avatar Jul 08 '19 17:07 ajvb