kibana icon indicating copy to clipboard operation
kibana copied to clipboard

[Detection Engine] Adds Alert Suppression to ML Rules

Open rylnd opened this issue 10 months ago • 29 comments

Summary

This PR introduces Alert Suppression for ML Detection Rules. This feature is behaviorally similar to alerting suppression for other Detection Engine Rule types, and nearly identical to the analogous features for EQL rules.

There are some additional UI behaviors introduced here as well, mainly intended to cover the shortcomings discovered in https://github.com/elastic/kibana/issues/183100. Those behaviors are:

  1. Populating the suppression field list with fields from the anomaly index(es).
  2. Disabling the suppression UI if no selected ML jobs are running (because we cannot populate the list of fields on which they'll be suppressing).
  3. Warning the user if some selected ML jobs are not running (because the list of suppression fields may be incomplete).

See screenshots below for more info.

Intermediate Serverless Deployment

As per the "intermediate deployment" requirements for serverless, while the schema (and declared alert SO mappings) will be extended to allow this functionality, the user-facing features are currently hidden behind a feature flag. Once this is merged and released, we can issue a "final" deployment in which the feature flag is enabled, and the feature effectively released.

Screenshots

  • Overview of new UI fields Screenshot 2024-05-16 at 3 22 02 PM
  • Example of Anomaly fields in suppression combobox Screenshot 2024-06-06 at 5 14 17 PM
  • Suppression disabled due to no jobs running Screenshot 2024-06-17 at 11 23 39 PM
  • Warning due to not all jobs running Screenshot 2024-06-17 at 11 26 16 PM

Steps to Review

  1. Review the Test Plan for an overview of behavior
  2. Review Integration tests for an overview of implementation and edge cases
  3. Review Cypress tests for an overview of UX changes
  4. Testing on Demo Instance (elastic/changeme)
    1. This instance has the relevant feature flag enabled, has some sample auditbeat data, as well as the anomalies archive data for the purposes of exercising an ML rule against "real" anomalies
    2. There are a few example rules in the default space:
      1. A simple query rule against auditbeat data
      2. An ML rule with per-execution suppression on both by_field_name and by_field_value (which ends up not actually suppressing anything)
      3. An ML rule with per-execution suppression on by_field_name (which suppresses all anomalies into a single alert)

Related Issues

  • This feature was temporarily blocked by https://github.com/elastic/kibana/issues/183100, but those changes are now in this PR.

Checklist

  • [x] Functional changes are hidden behind a feature flag. If not hidden, the PR explains why these changes are being implemented in a long-living feature branch.
  • [x] Functional changes are covered with a test plan and automated tests.
  • [ ] Stability of new and changed tests is verified using the Flaky Test Runner in both ESS and Serverless. By default, use 200 runs for ESS and 200 runs for Serverless.
    • ESS - Cypress x 200
    • Serverless - Cypress x 200
    • ESS - API x 200
    • Serverless - API x 200
  • [ ] Comprehensive manual testing is done by two engineers: the PR author and one of the PR reviewers. Changes are tested in both ESS and Serverless.
  • [ ] Mapping changes are accompanied by a technical design document. It can be a GitHub issue or an RFC explaining the changes. The design document is shared with and approved by the appropriate teams and individual stakeholders.
  • [ ] (OPTIONAL) OpenAPI specs changes include detailed descriptions and examples of usage and are ready to be released on https://docs.elastic.co/api-reference. NOTE: This is optional because at the moment we don't have yet any OpenAPI specs that would be fully "documented" and "GA-ready" for publishing on https://docs.elastic.co/api-reference.
  • [ ] Functional changes are communicated to the Docs team. A ticket is opened in https://github.com/elastic/security-docs using the Internal documentation request (Elastic employees) template. The following information is included: feature flags used, target ESS version, planned timing for ESS and Serverless releases.

rylnd avatar Apr 26 '24 22:04 rylnd

/ci

rylnd avatar May 08 '24 18:05 rylnd

/ci

rylnd avatar May 08 '24 22:05 rylnd

/ci

rylnd avatar May 09 '24 21:05 rylnd

/ci

rylnd avatar May 10 '24 20:05 rylnd

/ci

rylnd avatar May 16 '24 20:05 rylnd

Flaky Test Runner Stats

🟠 Some tests failed. - kibana-flaky-test-suite-runner#6019

[❌] x-pack/test/security_solution_api_integration/test_suites/detections_response/detection_engine/rule_execution_logic/trial_license_complete_tier/configs/ess.config.ts: 175/200 tests passed.

see run history

kibanamachine avatar May 16 '24 21:05 kibanamachine

Flaky Test Runner Stats

🟠 Some tests failed. - kibana-flaky-test-suite-runner#6020

[❌] x-pack/test/security_solution_api_integration/test_suites/detections_response/detection_engine/rule_execution_logic/trial_license_complete_tier/configs/serverless.config.ts: 143/200 tests passed.

see run history

kibanamachine avatar May 16 '24 21:05 kibanamachine

Pinging @elastic/security-detection-engine (Team:Detection Engine)

elasticmachine avatar May 16 '24 21:05 elasticmachine

The flaky runner caught two issues in my API tests:

  1. A nondeterministic ordering, paired with an assertion that assumed a specific ordering (fixed in fc40364)
  2. Archive-related failures that I've been pursuing on https://github.com/elastic/kibana/pull/182183. I believe I've figured out the issue, here, but the flaky test runner is being ... flaky.

TL;DR there's some outstanding flake in the API tests, but I'm on it.

rylnd avatar May 16 '24 22:05 rylnd

Flaky Test Runner Stats

🟠 Some tests failed. - kibana-flaky-test-suite-runner#6017

[❌] Security Solution Detection Engine - Cypress: 126/200 tests passed.

see run history

kibanamachine avatar May 17 '24 07:05 kibanamachine

Flaky Test Runner Stats

🟠 Some tests failed. - kibana-flaky-test-suite-runner#6018

[❌] [Serverless] Security Solution Detection Engine - Cypress: 132/200 tests passed.

see run history

kibanamachine avatar May 17 '24 09:05 kibanamachine

Hey @rylnd, the branch seems to be behind main and conflicts with it, and the CI is red. I'm guessing the PR is still a work in progress, so to avoid receiving notifications about it in team channels, I'll convert it to a draft. Let us know when you need a review from the RM team.

banderror avatar May 28 '24 12:05 banderror

/ci

rylnd avatar May 30 '24 22:05 rylnd

/ci

rylnd avatar May 31 '24 01:05 rylnd

/ci

rylnd avatar Jun 04 '24 16:06 rylnd

/ci

rylnd avatar Jun 04 '24 19:06 rylnd

@banderror et al: this is ready for review, again!

rylnd avatar Jun 06 '24 00:06 rylnd

I think it would be a good idea to get https://github.com/rylnd/kibana/pull/9 merged into this one. Otherwise, it's difficult to test PR and understand which fields available for suppression.

I have a custom ML job that uses another custom index. But I can see only ECS fields from alert index in UI.

vitaliidm avatar Jun 11 '24 14:06 vitaliidm

/ci

rylnd avatar Jun 18 '24 04:06 rylnd

/ci

rylnd avatar Jun 18 '24 20:06 rylnd

/ci

rylnd avatar Jun 18 '24 23:06 rylnd

/ci

rylnd avatar Jun 19 '24 00:06 rylnd

/ci

rylnd avatar Jun 21 '24 02:06 rylnd

I'm now reopening this for review, after the brief diversion of rylnd/kibana#9. The PR description has been updated to reflect those changes, and review steps now include a new demo environment.

There are currently 2-3 failing cypress tests relating to ML rule creation/edit. While they are sporadic and seem related to the test environment's ML jobs, they may also indicate a UI bug. I'm exploring all possibilities, but reviewers can either focus on the backend implementation or point an eye toward reproducing those failures. Thanks!

rylnd avatar Jun 21 '24 04:06 rylnd

Moving this back into draft for now; apologies for the noise. Please see this slack thread for more details on status.

rylnd avatar Jun 21 '24 22:06 rylnd

/ci

rylnd avatar Jun 24 '24 20:06 rylnd

Regarding merging to main and release to serverless, it's correct that we should wait to merge and release this https://github.com/elastic/project-controller/pull/945 ?

@nkhristinin my understanding is that merging/releasing this PR would be the "intermediate" release, where the schema is changed and the feature is technically supported, but is hidden behind a feature flag. We'd need those role changes merged before the subsequent "final" release, where we enable the feature via feature flag.

So: this PR and the roles one can be merged independently, so long as they both happen before that hypothetical "final" release 👍

rylnd avatar Jun 27 '24 20:06 rylnd

@rylnd, can you also have a look at a bunch of comments from a previous review? Thanks

vitaliidm avatar Jun 28 '24 10:06 vitaliidm

:yellow_heart: Build succeeded, but was flaky

Failed CI Steps

Metrics [docs]

Module Count

Fewer modules leads to a faster build time

id before after diff
securitySolution 5582 5585 +3

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id before after diff
securitySolution 15.6MB 15.6MB +3.3KB

Page load bundle

Size of the bundles that are downloaded on every page load. Target size is below 100kb

id before after diff
securitySolution 83.7KB 83.8KB +68.0B
Unknown metric groups

References to deprecated APIs

id before after diff
securitySolution 571 574 +3

History

  • :broken_heart: Build #218877 failed 19245967d25acbb5e39aecf5bbe2b9ef9bee5598
  • :broken_heart: Build #218588 failed d666eb03328d5a4ac01c9f97bcf6bac5dcca3874
  • :broken_heart: Build #218583 failed d152289331954a43ef76a10acf5ea053b082aa39
  • :yellow_heart: Build #218513 was flaky 6ce66d14c107a562c7c177a3e143f80f74730f11
  • :yellow_heart: Build #218338 was flaky 307e9644b439bdc71a0260bdd222c09939eb9dcf

To update your PR or re-run it, just comment with: @elasticmachine merge upstream

cc @rylnd

kibana-ci avatar Jul 01 '24 21:07 kibana-ci

@rylnd

I also think that we should disable duration and missing fields checkboxes, when suppression fields controller is disabled. Otherwise user can still edit it and save rule with new configuration

https://github.com/elastic/kibana/assets/92328789/1e58d9d4-e2f0-451c-88c6-ced6f349770e

vitaliidm avatar Jul 02 '24 15:07 vitaliidm