kibana [Detection Engine] Adds Alert Suppression to ML Rules

Summary

This PR introduces Alert Suppression for ML Detection Rules. This feature is behaviorally similar to alerting suppression for other Detection Engine Rule types, and nearly identical to the analogous features for EQL rules.

There are some additional UI behaviors introduced here as well, mainly intended to cover the shortcomings discovered in https://github.com/elastic/kibana/issues/183100. Those behaviors are:

Populating the suppression field list with fields from the anomaly index(es).
Disabling the suppression UI if no selected ML jobs are running (because we cannot populate the list of fields on which they'll be suppressing).
Warning the user if some selected ML jobs are not running (because the list of suppression fields may be incomplete).

See screenshots below for more info.

Intermediate Serverless Deployment

As per the "intermediate deployment" requirements for serverless, while the schema (and declared alert SO mappings) will be extended to allow this functionality, the user-facing features are currently hidden behind a feature flag. Once this is merged and released, we can issue a "final" deployment in which the feature flag is enabled, and the feature effectively released.

Screenshots

Overview of new UI fields
Example of Anomaly fields in suppression combobox
Suppression disabled due to no jobs running
Warning due to not all jobs running

Steps to Review

Review the Test Plan for an overview of behavior
Review Integration tests for an overview of implementation and edge cases
Review Cypress tests for an overview of UX changes
Testing on Demo Instance (elastic/changeme)
1. This instance has the relevant feature flag enabled, has some sample auditbeat data, as well as the anomalies archive data for the purposes of exercising an ML rule against "real" anomalies
2. There are a few example rules in the default space:
  1. A simple query rule against auditbeat data
  2. An ML rule with per-execution suppression on both by_field_name and by_field_value (which ends up not actually suppressing anything)
  3. An ML rule with per-execution suppression on by_field_name (which suppresses all anomalies into a single alert)

Related Issues

This feature was temporarily blocked by https://github.com/elastic/kibana/issues/183100, but those changes are now in this PR.

Checklist

[x] Functional changes are hidden behind a feature flag. If not hidden, the PR explains why these changes are being implemented in a long-living feature branch.
[x] Functional changes are covered with a test plan and automated tests.
- Test Plan
[ ] Stability of new and changed tests is verified using the Flaky Test Runner in both ESS and Serverless. By default, use 200 runs for ESS and 200 runs for Serverless.
- ESS - Cypress x 200
- Serverless - Cypress x 200
- ESS - API x 200
- Serverless - API x 200
[ ] Comprehensive manual testing is done by two engineers: the PR author and one of the PR reviewers. Changes are tested in both ESS and Serverless.
[ ] Mapping changes are accompanied by a technical design document. It can be a GitHub issue or an RFC explaining the changes. The design document is shared with and approved by the appropriate teams and individual stakeholders.
[ ] (OPTIONAL) OpenAPI specs changes include detailed descriptions and examples of usage and are ready to be released on https://docs.elastic.co/api-reference. NOTE: This is optional because at the moment we don't have yet any OpenAPI specs that would be fully "documented" and "GA-ready" for publishing on https://docs.elastic.co/api-reference.
[ ] Functional changes are communicated to the Docs team. A ticket is opened in https://github.com/elastic/security-docs using the Internal documentation request (Elastic employees) template. The following information is included: feature flags used, target ESS version, planned timing for ESS and Serverless releases.

Apr 26 '24 22:04 rylnd

/ci

May 08 '24 18:05 rylnd

/ci

May 08 '24 22:05 rylnd

/ci

May 09 '24 21:05 rylnd

/ci

May 10 '24 20:05 rylnd

/ci

May 16 '24 20:05 rylnd

Flaky Test Runner Stats

🟠 Some tests failed. - kibana-flaky-test-suite-runner#6019

[❌] x-pack/test/security_solution_api_integration/test_suites/detections_response/detection_engine/rule_execution_logic/trial_license_complete_tier/configs/ess.config.ts: 175/200 tests passed.

see run history

May 16 '24 21:05 kibanamachine

Flaky Test Runner Stats

🟠 Some tests failed. - kibana-flaky-test-suite-runner#6020

[❌] x-pack/test/security_solution_api_integration/test_suites/detections_response/detection_engine/rule_execution_logic/trial_license_complete_tier/configs/serverless.config.ts: 143/200 tests passed.

see run history

May 16 '24 21:05 kibanamachine

Pinging @elastic/security-detection-engine (Team:Detection Engine)

May 16 '24 21:05 elasticmachine

The flaky runner caught two issues in my API tests:

A nondeterministic ordering, paired with an assertion that assumed a specific ordering (fixed in fc40364)
Archive-related failures that I've been pursuing on https://github.com/elastic/kibana/pull/182183. I believe I've figured out the issue, here, but the flaky test runner is being ... flaky.

TL;DR there's some outstanding flake in the API tests, but I'm on it.

May 16 '24 22:05 rylnd

Flaky Test Runner Stats

🟠 Some tests failed. - kibana-flaky-test-suite-runner#6017

[❌] Security Solution Detection Engine - Cypress: 126/200 tests passed.

see run history

May 17 '24 07:05 kibanamachine

Flaky Test Runner Stats

🟠 Some tests failed. - kibana-flaky-test-suite-runner#6018

[❌] [Serverless] Security Solution Detection Engine - Cypress: 132/200 tests passed.

see run history

May 17 '24 09:05 kibanamachine

Hey @rylnd, the branch seems to be behind main and conflicts with it, and the CI is red. I'm guessing the PR is still a work in progress, so to avoid receiving notifications about it in team channels, I'll convert it to a draft. Let us know when you need a review from the RM team.

May 28 '24 12:05 banderror

/ci

May 30 '24 22:05 rylnd

/ci

May 31 '24 01:05 rylnd

/ci

Jun 04 '24 16:06 rylnd

/ci

Jun 04 '24 19:06 rylnd

@banderror et al: this is ready for review, again!

Jun 06 '24 00:06 rylnd

I think it would be a good idea to get https://github.com/rylnd/kibana/pull/9 merged into this one. Otherwise, it's difficult to test PR and understand which fields available for suppression.

I have a custom ML job that uses another custom index. But I can see only ECS fields from alert index in UI.

Jun 11 '24 14:06 vitaliidm

/ci

Jun 18 '24 04:06 rylnd

/ci

Jun 18 '24 20:06 rylnd

/ci

Jun 18 '24 23:06 rylnd

/ci

Jun 19 '24 00:06 rylnd

/ci

Jun 21 '24 02:06 rylnd

I'm now reopening this for review, after the brief diversion of rylnd/kibana#9. The PR description has been updated to reflect those changes, and review steps now include a new demo environment.

There are currently 2-3 failing cypress tests relating to ML rule creation/edit. While they are sporadic and seem related to the test environment's ML jobs, they may also indicate a UI bug. I'm exploring all possibilities, but reviewers can either focus on the backend implementation or point an eye toward reproducing those failures. Thanks!

Jun 21 '24 04:06 rylnd

Moving this back into draft for now; apologies for the noise. Please see this slack thread for more details on status.

Jun 21 '24 22:06 rylnd

/ci

Jun 24 '24 20:06 rylnd

Regarding merging to main and release to serverless, it's correct that we should wait to merge and release this https://github.com/elastic/project-controller/pull/945 ?

@nkhristinin my understanding is that merging/releasing this PR would be the "intermediate" release, where the schema is changed and the feature is technically supported, but is hidden behind a feature flag. We'd need those role changes merged before the subsequent "final" release, where we enable the feature via feature flag.

So: this PR and the roles one can be merged independently, so long as they both happen before that hypothetical "final" release 👍

Jun 27 '24 20:06 rylnd

@rylnd, can you also have a look at a bunch of comments from a previous review? Thanks

Jun 28 '24 10:06 vitaliidm

:yellow_heart: Build succeeded, but was flaky

Buildkite Build
Commit: a9b410d10115c9954450f8ba2752ee5c77d05b39

Failed CI Steps

Metrics [docs]

Module Count

Fewer modules leads to a faster build time

id	before	after	diff
`securitySolution`	5582	5585	+3

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id	before	after	diff
`securitySolution`	15.6MB	15.6MB	+3.3KB

Page load bundle

Size of the bundles that are downloaded on every page load. Target size is below 100kb

id	before	after	diff
`securitySolution`	83.7KB	83.8KB	+68.0B

Unknown metric groups

References to deprecated APIs

id	before	after	diff
`securitySolution`	571	574	+3

History

:broken_heart: Build #218877 failed 19245967d25acbb5e39aecf5bbe2b9ef9bee5598
:broken_heart: Build #218588 failed d666eb03328d5a4ac01c9f97bcf6bac5dcca3874
:broken_heart: Build #218583 failed d152289331954a43ef76a10acf5ea053b082aa39
:yellow_heart: Build #218513 was flaky 6ce66d14c107a562c7c177a3e143f80f74730f11
:yellow_heart: Build #218338 was flaky 307e9644b439bdc71a0260bdd222c09939eb9dcf

To update your PR or re-run it, just comment with: @elasticmachine merge upstream

cc @rylnd

Jul 01 '24 21:07 kibana-ci

@rylnd

I also think that we should disable duration and missing fields checkboxes, when suppression fields controller is disabled. Otherwise user can still edit it and save rule with new configuration

https://github.com/elastic/kibana/assets/92328789/1e58d9d4-e2f0-451c-88c6-ced6f349770e

Jul 02 '24 15:07 vitaliidm

kibana kibana copied to clipboard

[Detection Engine] Adds Alert Suppression to ML Rules

Summary

Intermediate Serverless Deployment

Screenshots

Steps to Review

Related Issues

Checklist

Flaky Test Runner Stats

🟠 Some tests failed. - kibana-flaky-test-suite-runner#6019

Flaky Test Runner Stats

🟠 Some tests failed. - kibana-flaky-test-suite-runner#6020

Flaky Test Runner Stats

🟠 Some tests failed. - kibana-flaky-test-suite-runner#6017

Flaky Test Runner Stats

🟠 Some tests failed. - kibana-flaky-test-suite-runner#6018

:yellow_heart: Build succeeded, but was flaky

Failed CI Steps

Metrics [docs]

Module Count

Async chunks

Page load bundle

References to deprecated APIs

History

kibana
kibana copied to clipboard