kibana
kibana copied to clipboard
[Metrics UI] Re-implement Anomaly Alerts with ML Alert Type
In PR #93813, we disabled anomaly alerting due to some misunderstandings about how to accurately query for the presence of anomalies. Since that point, ML has released its own anomaly alert type: https://github.com/elastic/kibana/pull/89286
We should re-implement our anomaly alert creation UI using this alert type instead of our own, so that all of the ML queries and calculations happen inside the ML app. We can work with the ML team to figure out which fields we can auto-supply (job ID, etc) and which we can/should expose to Metrics UI users.
AC:
- Users can set up alerts on their metrics UI-created anomaly detection jobs
- Under the hood, the ML alert type is used so that none of that logic leaks into our implementation
Notes/questions:
- Should we allow users to optionally create an alert on multiple jobs, i.e. alert on either Hosts OR Pods anomalies? The ML Alert type allows for multiple jobs to be specified.
Pinging @elastic/logs-metrics-ui (Team:logs-metrics-ui)
@peteharverson @phillipb @simianhacker @sorantis let's chat about this here in this ticket, re: feasability, known issues, questions about how this will work, etc.
Pinging @elastic/machine-learning
hey @jasonrhodes!
I looked into the metrics.alert.anomaly rule implementation and
have some questions:
License requirements
minimumLicenseRequired: 'basic' - probably it should not be the case, because ML requires a platinum license.
Rule params
nodeTypecan behostsork8s, I tried to track down what is the difference, but it leads to two very similar functions, so maybe it'd be better if you can point out what you're trying to achieve with it.metric,spaceId, andsourceId- I presume all these params can be replaced with the list of the anomaly detection job ids? In that case, it'd comply with the ML rule type.filterQuery(that you eventually pass asinfluencerFilter) - Influencer filters are not supported (yet) by ML anomaly detection rule type, but do you also need filtering by a partition field value or anything else?
Results and alert context
I've noticed you rely on the record results, it's already provided by the Anomaly detection result type. 👍 The only thing we need to add is typical and actual values to the alert context and generate a summary message.
But have you considered using bucket or influencer results instead?
Using ML rule type
- At the moment we use job id as an alert (ex-instance) id, so in case the alerting rule contains multiple jobs, the user can mute notifications for particular jobs. It's still can be changed in the future. In your case alert id is based on
nodeTypeandmetric, so we need to discuss how to handle it. Have you thought about including partition of influencer field values to the id so the user has more granular control over muting the alerts? - Are you going to create rules programmatically via Kibana Alerting API?
@darnautov thanks for the response! I'll let @elastic/metrics-ui folks who worked on this more closely (@simianhacker / @phillipb / @Zacqary) jump in and elaborate here, but I can answer a few of these questions.
nodeType can be hosts or k8s, I tried to track down what is the difference, but it leads to two very similar functions, so maybe it'd be better if you can point out what you're trying to achieve with it.
We designed these jobs with the ML team so I think they've been through a ton of back and forth -- I think @phillipb and @blaklaybul can best elaborate on the thinking behind those jobs.
metric, spaceId, and sourceId - I presume all these params can be replaced with the list of the anomaly detection job ids? In that case, it'd comply with the ML rule type.
I think the issue we're going to run into here is that we will basically want to wrap the ML anomaly rule type with our own UI, so that we can pre-fill some things on behalf of the user. Otherwise, we're going to leak a lot of the ML abstractions out to them and force them to understand things about ML jobs that, if they've only set the jobs up via our UI, they won't know about (e.g. job IDs). But yes, to your point, we'll be able to craft the job ID from the metrics you mentioned and then pass those into the ML executor.
What's not clear is if we can do this simply where we wrap the ML anomaly rule type with our own flyout UX, or if those things are coupled. I also don't remember but I think if we use the ML rule type, it may have consequences on our users' ability to edit those alerts after they're created.
Other options we may need to consider are:
- ML exports their executor function and we import it and use it in our rule type
- ML exports lower level functions for querying anomalies and we use those in our executor
- Both?
Influencer filters are not supported (yet) by ML anomaly detection rule type, but do you also need filtering by a partition field value or anything else?
I believe we just need to be able to target an alert to a specific node or nodes: e.g. one specific host.name, or all host.names that match pattern-*. I'm not sure what filtering by partition field value would enable us to do, can you elaborate?
nodeType can be hosts or k8s, I tried to track down what is the difference, but it leads to two very similar functions, so maybe it'd be better if you can point out what you're trying to achieve with it.
The metrics ML integration allows users to enable one of (or both) the metrics-ui ml modules, metrics_ui_hosts and metrics_ui_k8s.
The job names for the 6 jobs that fall under these modules follow the pattern <module_ref>_<job_name> where module_ref is one of ['hosts', 'k8s'] and job_name is one of ['memory_usage', 'network_in', 'network_out']. It looks like all possible pairs of nodeType and metric in the alert implementation covers the 6 jobs in the metrics ui.
Pinging @elastic/obs-ux-infra_services-team (Team:obs-ux-infra_services)
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Pinging @elastic/obs-ux-management-team (Team:obs-ux-management)
@elastic/obs-ux-management-team saw this come up as stale and moved to your board.
Not prioritized at this time.