kibana icon indicating copy to clipboard operation
kibana copied to clipboard

[Synthetics] Improve synthetics alerting

Open shahzad31 opened this issue 1 year ago • 14 comments

Summary

Fixes https://github.com/elastic/kibana/issues/175298

Improve synthetics alerting !!

User will be able to create custom synthetics status alert by defining three kind of criteria

Monitor is down over last consective checks with threshold

image

From Locations threshold

Will be considered down only when from defined number of locations

image

Over time with checks threshold just like uptime custom status alert

image

shahzad31 avatar Jun 21 '24 04:06 shahzad31

/ci

shahzad31 avatar Jun 21 '24 04:06 shahzad31

:robot: GitHub comments

Expand to view the GitHub comments

Just comment with:

  • /oblt-deploy : Deploy a Kibana instance using the Observability test environments.
  • run docs-build : Re-trigger the docs validation. (use unformatted text in the comment!)

obltmachine avatar Jun 21 '24 04:06 obltmachine

/ci

shahzad31 avatar Jun 21 '24 06:06 shahzad31

/ci

shahzad31 avatar Jun 21 '24 12:06 shahzad31

/ci

shahzad31 avatar Jul 01 '24 07:07 shahzad31

/ci

shahzad31 avatar Jul 18 '24 16:07 shahzad31

/ci

shahzad31 avatar Jul 18 '24 17:07 shahzad31

/ci

shahzad31 avatar Jul 19 '24 11:07 shahzad31

:broken_heart: Build Failed

Failed CI Steps

Test Failures

  • [job] [logs] FTR Configs #28 / Synthetics API Tests GetMonitorsOverview accepts search queries
  • [job] [logs] FTR Configs #28 / Synthetics API Tests GetMonitorsOverview accepts search queries

Metrics [docs]

Module Count

Fewer modules leads to a faster build time

id before after diff
synthetics 914 1079 +165

Public APIs missing comments

Total count of every public API that lacks a comment. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats comments for more detailed information.

id before after diff
@kbn/observability-synthetics-test-data - 4 +4

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id before after diff
synthetics 854.8KB 1.1MB :warning: +247.7KB
triggersActionsUi 1.7MB 1.7MB +119.0B
total +247.8KB

Public APIs missing exports

Total count of every type that is part of your API that should be exported but is not. This will cause broken links in the API documentation system. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats exports for more detailed information.

id before after diff
@kbn/observability-synthetics-test-data - 1 +1

Page load bundle

Size of the bundles that are downloaded on every page load. Target size is below 100kb

id before after diff
synthetics 19.4KB 21.7KB +2.3KB
Unknown metric groups

API count

id before after diff
@kbn/observability-synthetics-test-data - 4 +4

async chunk count

id before after diff
synthetics 8 9 +1

ESLint disabled line counts

id before after diff
synthetics 84 83 -1

Total ESLint disabled count

id before after diff
synthetics 91 90 -1

History

  • :broken_heart: Build #222448 failed 424bc3d82b8c20260c804d068a3ccd7c4aa27604
  • :broken_heart: Build #222419 failed 7c19f87c96ad1e29ea6ff47f91b814e2d911bfee

elasticmachine avatar Jul 19 '24 12:07 elasticmachine

Pinging @elastic/obs-ux-management-team (Team:obs-ux-management)

elasticmachine avatar Jul 21 '24 08:07 elasticmachine

@elasticmachine merge upstream

shahzad31 avatar Aug 26 '24 16:08 shahzad31

/oblt-deploy

shahzad31 avatar Aug 27 '24 13:08 shahzad31

When selecting a monitor from the monitor filter, the monitor id appears in the selection box, rather than the name Screenshot 2024-08-27 at 10 07 07 AM

dominiqueclarke avatar Aug 27 '24 14:08 dominiqueclarke

/oblt-deploy

shahzad31 avatar Sep 04 '24 11:09 shahzad31

/oblt-deploy

shahzad31 avatar Sep 17 '24 14:09 shahzad31

Is this expected?

https://github.com/user-attachments/assets/d4156882-0117-4e8e-8d08-b72c0992683b

And does this info mean that users will receive one alert per monitor? I wonder if we can make it clearer to the user what exactly this means. cc @maciejforcone @paulb-elastic

maryam-saeidi avatar Sep 19 '24 13:09 maryam-saeidi

@maryam-saeidi, yes, the behaviour is that if you specify from at least 1 location (i.e. only 1) then you can select the toggle to get alerts from each location independently (this is how it works today as each monitor+location pair is considered independently).

When defining >1 you cannot select this option because all locations are now being considered and there needs to be a failure from more than one for the alert to be sent (so you can't opt to have them separately per location).

To your point though, this is not very clear and UX/wording could be improved.

I don't know if we do this elsewhere (@maciejforcone?), maybe when selecting >1, instead of the toggle disappearing, it is disabled (greyed) and a description added next to it?

I'm not convinced this wording is clear but along the lines of (you can't select to send alerts per location if the condition is checking FROM AT LEAST multiple locations) (@florent-leborgne perhaps you could suggest something better here?)

paulb-elastic avatar Sep 19 '24 14:09 paulb-elastic

@paulb-elastic happy to help but I'm not sure to understand correctly what this does. When the toggle is available, what is the difference when it's off vs. when it's on?

florent-leborgne avatar Sep 19 '24 14:09 florent-leborgne

Thanks, @paulb-elastic, for clarification.

Regarding the info message:

Monitor alerts are always grouped by monitor ID.

Can we maybe clarify it to: "Separate alerts will be sent per monitor" or something like that, if that is what we mean?

maryam-saeidi avatar Sep 19 '24 16:09 maryam-saeidi

@maryam-saeidi

Can we maybe clarify it to: "Separate alerts will be sent per monitor"

Yes - good shout (sorry, I hadn't picked up on that from your earlier screen shot) I think your description is much better

@florent-leborgne

happy to help but I'm not sure to understand correctly what this does. When the toggle is available, what is the difference when it's off vs. when it's on?

Yeah - very fair point - as I wrote that example description I wasn't happy with how to shrink this down to a single / short line, so not surprised it doesn't make sense.

I'll try to elaborate here and not be limited by short space.

Firstly, some background... monitors are run independently from each of the locations they are configured to run from. So, you can have My Monitor monitoring My Website from LocationA, LocationB and LocationC.

At the moment, the monitor+location is completely independent and will alert/recover independently. For example if My Website goes down completely then each independent test from LocationA, LocationB and LocationC will begin to fail and I will get 3 alerts for My Monitor from LocationA, LocationB and LocationC, separately.

Let’s imagine that My Website has not gone down, but there is some internet connectivity problem in the region of LocationA. In that scenario, the monitor running in LocationA will go DOWN, but those in LocationB and LocationC are UP. At the moment (current behaviour in Synthetics), you would get an alert from LocationA only.

This is a fair and reasonable scenario that many users will want from Synthetics. However, there are other users who only want to know about the problem if multiple locations report the problem. This could be because for example that they know there are sometimes transient issues from certain regions that they cannot do much to resolve (so such an alert is just noise).

In those cases, users may want the behaviour ”only alert me when at least 2 locations report that there is a problem”. In our example, this could be from any of LocationA, LocationB or LocationC, but importantly, only generate an alert if >1 of the locations report a problem.

This new rule allows users to be able to configure that.

Send alert per location From at least Behaviour
Off = 1 As soon as there is a failure from any location, send an alert. You will only get 1 alert for that monitor, regardless of how many locations report the failure.
On = 1 As soon as a location detects a failure, send an alert. You will receive separate alerts from any and all locations that identify the failure.
n/a > 1 (n) When at least n locations identify the failure, send an alert. You will only get 1 alert for that monitor, regardless of how many locations report the failure (as long as it’s from more than n locations).

(I appreciate that it’s not the location that actually sends the alert, but the rule that runs in Kibana and looks for those scenarios, but have written it like that as I think it’s clearer to understand the behaviour)

I am not sure how to boil that down into a simple description line of text next to the toggle 😬

paulb-elastic avatar Sep 20 '24 12:09 paulb-elastic

Thanks for the details @paulb-elastic!

I would suggest one of the following solutions (hopefully I understood it correctly now 😅):

  • Renaming the toggle to "Receive distinct alerts for each location". We could even append a tooltip next to it if we feel it's necessary. To say something like "When the monitor detects a failure on one or more locations, you receive an alert for each of these locations, instead of a single alert."
  • Use 2 radio buttons instead of a toggle, that way we can explain the behaviour for each state (for example the one selected by default could be "Receive a unique alert" with a tooltip to keep the label short and still explain more "When the monitor detects a failure on one or more locations, you get only one alert.", and the second option would be "Receive distinct alerts for each location")

Would that solve the issue?

florent-leborgne avatar Sep 20 '24 14:09 florent-leborgne

Thanks @florent-leborgne they're great suggestions!!

I think the toggle (first suggestion) might be slightly better in that it fits in more with the other configuration controls we have in general (at least, from what I can tell), although there's not much in it if others think the radio buttons would be better.

I like your wording - it's a lot clearer.

I was thinking about how it "disappears" too, and where I suggested above we disable it (grey it out). I think if the user selects >1, we should also switch the toggle off (as well as it being disabled so the user cannot switch it on) as that also describes the behaviour (i.e. you WON'T receive distinct alerts for each location).

paulb-elastic avatar Sep 20 '24 17:09 paulb-elastic

Updated copies

image

Disabled state

image

shahzad31 avatar Sep 23 '24 12:09 shahzad31

Flaky Test Runner Stats

🟠 Some tests failed. - kibana-flaky-test-suite-runner#7012

[❌] x-pack/test/alerting_api_integration/observability/config.ts: 25/50 tests passed.

see run history

kibanamachine avatar Sep 25 '24 12:09 kibanamachine

Flaky Test Runner Stats

🎉 All tests passed! - kibana-flaky-test-suite-runner#7017

[✅] x-pack/test/alerting_api_integration/observability/config.ts: 50/50 tests passed.

see run history

kibanamachine avatar Sep 26 '24 08:09 kibanamachine

Flaky Test Runner Stats

🎉 All tests passed! - kibana-flaky-test-suite-runner#7018

[✅] x-pack/test/alerting_api_integration/observability/config.ts: 50/50 tests passed.

see run history

kibanamachine avatar Sep 26 '24 08:09 kibanamachine

:green_heart: Build Succeeded

  • Buildkite Build
  • Commit: c9fa0da1c8f181f9378dafecd8fc9fa2c2e44dc3
  • Kibana Serverless Image: docker.elastic.co/kibana-ci/kibana-serverless:pr-186585-c9fa0da1c8f1

Metrics [docs]

Module Count

Fewer modules leads to a faster build time

id before after diff
synthetics 1019 1154 +135

Public APIs missing comments

Total count of every public API that lacks a comment. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats comments for more detailed information.

id before after diff
@kbn/observability-synthetics-test-data - 4 +4

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id before after diff
synthetics 964.5KB 1.2MB :warning: +237.5KB
triggersActionsUi 1.6MB 1.6MB +115.0B
total +237.6KB

Public APIs missing exports

Total count of every type that is part of your API that should be exported but is not. This will cause broken links in the API documentation system. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats exports for more detailed information.

id before after diff
@kbn/observability-synthetics-test-data - 1 +1

Page load bundle

Size of the bundles that are downloaded on every page load. Target size is below 100kb

id before after diff
synthetics 36.9KB 37.1KB +221.0B
Unknown metric groups

API count

id before after diff
@kbn/observability-synthetics-test-data - 4 +4

ESLint disabled line counts

id before after diff
@kbn/synthetics-e2e 19 18 -1
@kbn/test-suites-xpack 721 722 +1
synthetics 49 52 +3
total +3

Total ESLint disabled count

id before after diff
@kbn/synthetics-e2e 20 19 -1
@kbn/test-suites-xpack 745 746 +1
synthetics 55 58 +3
total +3

History

  • :yellow_heart: Build #238291 was flaky b89059c5047ef0fd88682cec5b969498ba89415b
  • :green_heart: Build #238129 succeeded 6fec0e2ded4339dd76d324f4f6816558b0618f2e
  • :broken_heart: Build #238058 failed 27c2816fd4029168ee1559fd24164fb4c941f10b
  • :broken_heart: Build #237898 failed 2b4f9715cba693eeae11658dee07baf5ffc63628
  • :broken_heart: Build #237731 failed cf339ed630d4ff79b7ff84a601fc656192e421ba

To update your PR or re-run it, just comment with: @elasticmachine merge upstream

kibana-ci avatar Oct 01 '24 16:10 kibana-ci

Starting backport for target branches: 8.x

https://github.com/elastic/kibana/actions/runs/11129829761

kibanamachine avatar Oct 01 '24 16:10 kibanamachine

💔 All backports failed

Status Branch Result
8.x Backport failed because of merge conflicts

You might need to backport the following PRs to 8.x:
- [Synthetics] Remove dead code (#193335)

Manual backport

To create the backport manually run:

node scripts/backport --pr 186585

Questions ?

Please refer to the Backport tool documentation

kibanamachine avatar Oct 01 '24 16:10 kibanamachine

💚 All backports created successfully

Status Branch Result
8.x

Note: Successful backport PRs will be merged automatically after passing CI.

Questions ?

Please refer to the Backport tool documentation

shahzad31 avatar Oct 02 '24 09:10 shahzad31