hypertrace icon indicating copy to clipboard operation
hypertrace copied to clipboard

[Feature Request] Alerting support for metric series

Open kotharironak opened this issue 3 years ago • 5 comments

Use Case

Currently, HT supports following out-of-box time series metrics for service and API entities based on tracing data.

  • call rate
  • error rate
  • latency

There are use cases related to alerting for helping in RCA, MTTD, degradation of service etc. As an example,

  • Use case 1: As a service/ API owner, I want to get an alert whenever there is a sudden spike/ sharp increase in latency of any operation/ API calls from or to my service.
  • Use case 2: As a service/ API owner, I want to get an alert whenever there is a sudden spike in traffic/ increase in call rate to my service/ API.
  • Use case 3: As a service/ API owner, I want to get an alert whenever there is a sudden spike in errors/ increase in error rate for my service.

As part of these feature requests, can we enhance HT with alerting capabilities on time series metrics

Proposal

  • Have a way to configure alert
  • Have a way to notify to slack/email or webhook
  • Incorporate an evaluation engine for alert configuration
  • Have a way to list all the configured alert

Work items:

Phase 1:

  • [x] #255
  • [x] #256
  • [x] #257
  • [x] #258
  • [x] #259
  • [x] #264
  • [x] #261
  • [x] #271
  • [x] #272
  • [x] #260
  • [x] #266
  • [x] #273
    • [x] #262
    • [x] #267

Phase 2:

  • [x] https://github.com/hypertrace/hypertrace-alert-engine/issues/55
  • [x] https://github.com/hypertrace/hypertrace-alert-engine/issues/52
  • [x] https://github.com/hypertrace/hypertrace-alert-engine/issues/65
  • [x] https://github.com/hypertrace/hypertrace-alert-engine/issues/40
  • [x] https://github.com/hypertrace/hypertrace-alert-engine/issues/64
  • [x] https://github.com/hypertrace/hypertrace-alert-engine/issues/42

Phase3:

  • [x] #300
  • [x] #301
  • [x] #303
  • [x] https://github.com/hypertrace/hypertrace-alert-engine/issues/54
  • [x] https://github.com/hypertrace/hypertrace-alert-engine/issues/76
  • [x] https://github.com/hypertrace/hypertrace-alert-engine/issues/77
  • [ ] #318

Backlogs and Enhancements:

  • [ ] https://github.com/hypertrace/hypertrace-alert-engine/issues/73
  • [ ] https://github.com/hypertrace/hypertrace-alert-engine/issues/53
  • [ ] https://github.com/hypertrace/hypertrace-alert-engine/issues/43
  • [ ] #319
  • [ ] provide generic queue interface to plugin in-memory or other queues other than kafka

Status: Phase 3 going on.

kotharironak avatar Jun 07 '21 15:06 kotharironak

As part of this, we can go ahead with anomaly-based alerting rule definition as it can cover basic thresholding scenarios as well. As an example, we can have the following 4 components for alert rule definition.

  1. metric selection definition (e.g metric attribute, its scope, aggregation function, granularity, etc)
  2. baseline calculation definition (e.g historical data based on past 12hrs or static threshold-based)
  3. bounding box (defines bounds for deviating from baseline for warning and critical condition)
  4. duration (longevity) condition ( helps in expressing like condition is violating for last 5 mins)

As an example, if we have to define alert for below.

Alert me if avg latency at 1 min granularity for service (1234) is deviating 2x in both (critical and warning) 
bounds from baseline using 1 day of data and using avg function for more than 5 min.

For above, we can have a rule definition as below.

Alerts: 
  - name: High_Avg_Latency
    # define metric selection
    metric_selection:
      metric: duration
      scope: SERVICE
      function: AVG
      filters:
        - operator : EQ
          attribute: id
          value : 
            string: "1234"
      granularity: 
        size: 1
        unit: min
    # define baseline
    baseline:
      type: DYNAMIC
      period:
        size: 1
        unit: day
      function: AVG
    # define bounding box
    bounding_box:
      upper_bound:
        critical: 2
        warning: 2
      lower_bound:
        critical: -2
        warning: -2
    # longevity condition
    duration:
      size: 5
      unit: min

For evaluation, we can think of two options.

  1. periodic rule evaluation by fetching data as query
  2. evaluating our streaming pipeline using input StructuredTrace

Option 1: periodic rule evaluation by fetching data as query In this option, we will have a job that will evaluate all the allocated rules in a periodic fashion. At a high level, we will have.

alert-evaluator-job: 
// runs below loop at every 1 min, 
// all the rule will be updated to get the effect of CRUD
eval_loop:
  for each rule in all_rules:
    - fetch all metrics data points for a current min of a defined metric by query
    - fetch all metrics points for required baseline calculation by a query (if needed)
      - optimize these steps by a cache, and using a sliding window
    - Calculate bounding box condition using baseline data
    - Evaluate all the current metric data points against the bounding box
      - update duration (longevity) state, and if met raise an alert

Option 2: evaluating our streaming pipeline using input StructuredTrace In this option, we will have a steaming job that will evaluate alert rule using incoming stream data. At a high level, we will have two stream node processors as part of the job.

  • the first stream node will extract the raw metric points from the incoming trace data (StructuredTrace) for a required rule
  • the second stream node works on this metrics stream to evaluate the alert condition.

kotharironak avatar Jun 08 '21 06:06 kotharironak

Adding work items lists as part of the description and associated tickets if indivisual items need further details.

kotharironak avatar Jun 08 '21 12:06 kotharironak

@kotharironak I am guessing this will configurable from Hypertrace UI or do we have to create a separate rule definition for the alert

subintp avatar Jun 08 '21 17:06 subintp

@subintp In the first phase, I was thinking of a second option. We were thinking of having either graphQL or config service's API for CRUD for alert rules. So, you will able to define all the alert's in an alert config file for the interested API or Service entity. And, we will have a simple script to deploy them into the system.

kotharironak avatar Jun 10 '21 06:06 kotharironak

As discussed offline with @vv, there was a need to support metadata with an alert rule, and also have suppressed (evaluation) window like

alert_evalution_time:
      window: 5am - 5pm

kotharironak avatar Jun 14 '21 04:06 kotharironak