kibana
kibana copied to clipboard
[Detection Engine] Adds Alert Suppression to ML Rules
Summary
This PR introduces Alert Suppression for ML Detection Rules. This feature is behaviorally similar to alerting suppression for other Detection Engine Rule types, and nearly identical to the analogous features for EQL rules.
There are some additional UI behaviors introduced here as well, mainly intended to cover the shortcomings discovered in https://github.com/elastic/kibana/issues/183100. Those behaviors are:
- Populating the suppression field list with fields from the anomaly index(es).
- Disabling the suppression UI if no selected ML jobs are running (because we cannot populate the list of fields on which they'll be suppressing).
- Warning the user if some selected ML jobs are not running (because the list of suppression fields may be incomplete).
See screenshots below for more info.
Intermediate Serverless Deployment
As per the "intermediate deployment" requirements for serverless, while the schema (and declared alert SO mappings) will be extended to allow this functionality, the user-facing features are currently hidden behind a feature flag. Once this is merged and released, we can issue a "final" deployment in which the feature flag is enabled, and the feature effectively released.
Screenshots
- Overview of new UI fields
- Example of Anomaly fields in suppression combobox
- Suppression disabled due to no jobs running
- Warning due to not all jobs running
Steps to Review
- Review the Test Plan for an overview of behavior
- Review Integration tests for an overview of implementation and edge cases
- Review Cypress tests for an overview of UX changes
- Testing on Demo Instance (elastic/changeme)
- This instance has the relevant feature flag enabled, has some sample auditbeat data, as well as the anomalies archive data for the purposes of exercising an ML rule against "real" anomalies
- There are a few example rules in the default space:
- A simple query rule against auditbeat data
- An ML rule with per-execution suppression on both
by_field_name
andby_field_value
(which ends up not actually suppressing anything) - An ML rule with per-execution suppression on
by_field_name
(which suppresses all anomalies into a single alert)
Related Issues
- This feature was temporarily blocked by https://github.com/elastic/kibana/issues/183100, but those changes are now in this PR.
Checklist
- [x] Functional changes are hidden behind a feature flag. If not hidden, the PR explains why these changes are being implemented in a long-living feature branch.
- [x] Functional changes are covered with a test plan and automated tests.
- [ ] Stability of new and changed tests is verified using the Flaky Test Runner in both ESS and Serverless. By default, use 200 runs for ESS and 200 runs for Serverless.
- ESS - Cypress x 200
- Serverless - Cypress x 200
- ESS - API x 200
- Serverless - API x 200
- [ ] Comprehensive manual testing is done by two engineers: the PR author and one of the PR reviewers. Changes are tested in both ESS and Serverless.
- [ ] Mapping changes are accompanied by a technical design document. It can be a GitHub issue or an RFC explaining the changes. The design document is shared with and approved by the appropriate teams and individual stakeholders.
- [ ] (OPTIONAL) OpenAPI specs changes include detailed descriptions and examples of usage and are ready to be released on https://docs.elastic.co/api-reference. NOTE: This is optional because at the moment we don't have yet any OpenAPI specs that would be fully "documented" and "GA-ready" for publishing on https://docs.elastic.co/api-reference.
- [ ] Functional changes are communicated to the Docs team. A ticket is opened in https://github.com/elastic/security-docs using the Internal documentation request (Elastic employees) template. The following information is included: feature flags used, target ESS version, planned timing for ESS and Serverless releases.
/ci
/ci
/ci
/ci
/ci
Flaky Test Runner Stats
🟠 Some tests failed. - kibana-flaky-test-suite-runner#6019
[❌] x-pack/test/security_solution_api_integration/test_suites/detections_response/detection_engine/rule_execution_logic/trial_license_complete_tier/configs/ess.config.ts: 175/200 tests passed.
Flaky Test Runner Stats
🟠 Some tests failed. - kibana-flaky-test-suite-runner#6020
[❌] x-pack/test/security_solution_api_integration/test_suites/detections_response/detection_engine/rule_execution_logic/trial_license_complete_tier/configs/serverless.config.ts: 143/200 tests passed.
Pinging @elastic/security-detection-engine (Team:Detection Engine)
The flaky runner caught two issues in my API tests:
- A nondeterministic ordering, paired with an assertion that assumed a specific ordering (fixed in fc40364)
- Archive-related failures that I've been pursuing on https://github.com/elastic/kibana/pull/182183. I believe I've figured out the issue, here, but the flaky test runner is being ... flaky.
TL;DR there's some outstanding flake in the API tests, but I'm on it.
Flaky Test Runner Stats
🟠 Some tests failed. - kibana-flaky-test-suite-runner#6017
[❌] Security Solution Detection Engine - Cypress: 126/200 tests passed.
Flaky Test Runner Stats
🟠 Some tests failed. - kibana-flaky-test-suite-runner#6018
[❌] [Serverless] Security Solution Detection Engine - Cypress: 132/200 tests passed.
Hey @rylnd, the branch seems to be behind main and conflicts with it, and the CI is red. I'm guessing the PR is still a work in progress, so to avoid receiving notifications about it in team channels, I'll convert it to a draft. Let us know when you need a review from the RM team.
/ci
/ci
/ci
/ci
@banderror et al: this is ready for review, again!
I think it would be a good idea to get https://github.com/rylnd/kibana/pull/9 merged into this one. Otherwise, it's difficult to test PR and understand which fields available for suppression.
I have a custom ML job that uses another custom index. But I can see only ECS fields from alert index in UI.
/ci
/ci
/ci
/ci
/ci
I'm now reopening this for review, after the brief diversion of rylnd/kibana#9. The PR description has been updated to reflect those changes, and review steps now include a new demo environment.
There are currently 2-3 failing cypress tests relating to ML rule creation/edit. While they are sporadic and seem related to the test environment's ML jobs, they may also indicate a UI bug. I'm exploring all possibilities, but reviewers can either focus on the backend implementation or point an eye toward reproducing those failures. Thanks!
Moving this back into draft for now; apologies for the noise. Please see this slack thread for more details on status.
/ci
Regarding merging to main and release to serverless, it's correct that we should wait to merge and release this https://github.com/elastic/project-controller/pull/945 ?
@nkhristinin my understanding is that merging/releasing this PR would be the "intermediate" release, where the schema is changed and the feature is technically supported, but is hidden behind a feature flag. We'd need those role changes merged before the subsequent "final" release, where we enable the feature via feature flag.
So: this PR and the roles one can be merged independently, so long as they both happen before that hypothetical "final" release 👍
@rylnd, can you also have a look at a bunch of comments from a previous review? Thanks
:yellow_heart: Build succeeded, but was flaky
- Buildkite Build
- Commit: a9b410d10115c9954450f8ba2752ee5c77d05b39
Failed CI Steps
Metrics [docs]
Module Count
Fewer modules leads to a faster build time
id | before | after | diff |
---|---|---|---|
securitySolution |
5582 | 5585 | +3 |
Async chunks
Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app
id | before | after | diff |
---|---|---|---|
securitySolution |
15.6MB | 15.6MB | +3.3KB |
Page load bundle
Size of the bundles that are downloaded on every page load. Target size is below 100kb
id | before | after | diff |
---|---|---|---|
securitySolution |
83.7KB | 83.8KB | +68.0B |
Unknown metric groups
References to deprecated APIs
id | before | after | diff |
---|---|---|---|
securitySolution |
571 | 574 | +3 |
History
- :broken_heart: Build #218877 failed 19245967d25acbb5e39aecf5bbe2b9ef9bee5598
- :broken_heart: Build #218588 failed d666eb03328d5a4ac01c9f97bcf6bac5dcca3874
- :broken_heart: Build #218583 failed d152289331954a43ef76a10acf5ea053b082aa39
- :yellow_heart: Build #218513 was flaky 6ce66d14c107a562c7c177a3e143f80f74730f11
- :yellow_heart: Build #218338 was flaky 307e9644b439bdc71a0260bdd222c09939eb9dcf
To update your PR or re-run it, just comment with:
@elasticmachine merge upstream
cc @rylnd
@rylnd
I also think that we should disable duration and missing fields checkboxes, when suppression fields controller is disabled. Otherwise user can still edit it and save rule with new configuration
https://github.com/elastic/kibana/assets/92328789/1e58d9d4-e2f0-451c-88c6-ced6f349770e