hypertrace
hypertrace copied to clipboard
[Feature Request] Alerting support for metric series
Use Case
Currently, HT supports following out-of-box time series metrics for service and API entities based on tracing data.
- call rate
- error rate
- latency
There are use cases related to alerting for helping in RCA, MTTD, degradation of service etc. As an example,
- Use case 1: As a service/ API owner, I want to get an alert whenever there is a sudden spike/ sharp increase in latency of any operation/ API calls from or to my service.
- Use case 2: As a service/ API owner, I want to get an alert whenever there is a sudden spike in traffic/ increase in call rate to my service/ API.
- Use case 3: As a service/ API owner, I want to get an alert whenever there is a sudden spike in errors/ increase in error rate for my service.
As part of these feature requests, can we enhance HT with alerting capabilities on time series metrics
Proposal
- Have a way to configure alert
- Have a way to notify to slack/email or webhook
- Incorporate an evaluation engine for alert configuration
- Have a way to list all the configured alert
Work items:
Phase 1:
- [x] #255
- [x] #256
- [x] #257
- [x] #258
- [x] #259
- [x] #264
- [x] #261
- [x] #271
- [x] #272
- [x] #260
- [x] #266
- [x] #273
- [x] #262
- [x] #267
Phase 2:
- [x] https://github.com/hypertrace/hypertrace-alert-engine/issues/55
- [x] https://github.com/hypertrace/hypertrace-alert-engine/issues/52
- [x] https://github.com/hypertrace/hypertrace-alert-engine/issues/65
- [x] https://github.com/hypertrace/hypertrace-alert-engine/issues/40
- [x] https://github.com/hypertrace/hypertrace-alert-engine/issues/64
- [x] https://github.com/hypertrace/hypertrace-alert-engine/issues/42
Phase3:
- [x] #300
- [x] #301
- [x] #303
- [x] https://github.com/hypertrace/hypertrace-alert-engine/issues/54
- [x] https://github.com/hypertrace/hypertrace-alert-engine/issues/76
- [x] https://github.com/hypertrace/hypertrace-alert-engine/issues/77
- [ ] #318
Backlogs and Enhancements:
- [ ] https://github.com/hypertrace/hypertrace-alert-engine/issues/73
- [ ] https://github.com/hypertrace/hypertrace-alert-engine/issues/53
- [ ] https://github.com/hypertrace/hypertrace-alert-engine/issues/43
- [ ] #319
- [ ] provide generic queue interface to plugin in-memory or other queues other than kafka
Status: Phase 3 going on.
As part of this, we can go ahead with anomaly-based alerting rule definition as it can cover basic thresholding scenarios as well. As an example, we can have the following 4 components for alert rule definition.
- metric selection definition (e.g metric attribute, its scope, aggregation function, granularity, etc)
- baseline calculation definition (e.g historical data based on past 12hrs or static threshold-based)
- bounding box (defines bounds for deviating from baseline for warning and critical condition)
- duration (longevity) condition ( helps in expressing like condition is violating for last 5 mins)
As an example, if we have to define alert for below.
Alert me if avg latency at 1 min granularity for service (1234) is deviating 2x in both (critical and warning)
bounds from baseline using 1 day of data and using avg function for more than 5 min.
For above, we can have a rule definition as below.
Alerts:
- name: High_Avg_Latency
# define metric selection
metric_selection:
metric: duration
scope: SERVICE
function: AVG
filters:
- operator : EQ
attribute: id
value :
string: "1234"
granularity:
size: 1
unit: min
# define baseline
baseline:
type: DYNAMIC
period:
size: 1
unit: day
function: AVG
# define bounding box
bounding_box:
upper_bound:
critical: 2
warning: 2
lower_bound:
critical: -2
warning: -2
# longevity condition
duration:
size: 5
unit: min
For evaluation, we can think of two options.
- periodic rule evaluation by fetching data as query
- evaluating our streaming pipeline using input StructuredTrace
Option 1: periodic rule evaluation by fetching data as query In this option, we will have a job that will evaluate all the allocated rules in a periodic fashion. At a high level, we will have.
alert-evaluator-job:
// runs below loop at every 1 min,
// all the rule will be updated to get the effect of CRUD
eval_loop:
for each rule in all_rules:
- fetch all metrics data points for a current min of a defined metric by query
- fetch all metrics points for required baseline calculation by a query (if needed)
- optimize these steps by a cache, and using a sliding window
- Calculate bounding box condition using baseline data
- Evaluate all the current metric data points against the bounding box
- update duration (longevity) state, and if met raise an alert
Option 2: evaluating our streaming pipeline using input StructuredTrace In this option, we will have a steaming job that will evaluate alert rule using incoming stream data. At a high level, we will have two stream node processors as part of the job.
- the first stream node will extract the raw metric points from the incoming trace data (StructuredTrace) for a required rule
- the second stream node works on this metrics stream to evaluate the alert condition.
Adding work items lists as part of the description and associated tickets if indivisual items need further details.
@kotharironak I am guessing this will configurable from Hypertrace UI or do we have to create a separate rule definition for the alert
@subintp In the first phase, I was thinking of a second option. We were thinking of having either graphQL or config service's API for CRUD for alert rules. So, you will able to define all the alert's in an alert config file for the interested API or Service entity. And, we will have a simple script to deploy them into the system.
As discussed offline with @vv, there was a need to support metadata with an alert rule, and also have suppressed (evaluation) window like
alert_evalution_time:
window: 5am - 5pm