alerting
alerting copied to clipboard
[BUG] {{period_start}} and {{period_end}} seems not respected
Describe the bug To receive error log messages, I created an alert using below query and run it on every minutes as mentioned in #187.
Extraction query:
{
"size": 10,
"query": {
"bool": {
"filter": [
{
"range": {
"@timestamp": {
"from": "{{period_start}}||-1m",
"to": "{{period_end}}",
"include_lower": true,
"include_upper": true,
"format": "epoch_millis",
"boost": 1
}
}
},
{
"match": {
"log": {
"query": "error",
"operator": "OR",
"prefix_length": 0,
"max_expansions": 50,
"fuzzy_transpositions": true,
"lenient": false,
"zero_terms_query": "NONE",
"auto_generate_synonyms_phrase_query": true,
"boost": 1
}
}
}
],
"adjust_pure_negative": true,
"boost": 1
}
},
"aggregations": {}
}
Trigger message template:
Period: {{ctx.periodStart}} - {{ctx.periodEnd}}
{{#ctx.results}}
Total hits: {{hits.total.value}} ({{hits.hits.length}}{{hits.hits.size}} shown)
{{#hits.hits}}
{{#_source.log_processed}}
{{_source.@timestamp}} [{{level}}] {{logger}} - {{message}}
{{/_source.log_processed}}
{{/hits.hits}}
{{/ctx.results}}
Question1:
It seems work well, but I got two consecutive emails for one error log.
At 2021-08-17T18:52:
Period: 2021-08-17T18:51:30.629Z - 2021-08-17T18:52:30.629Z
Total hits: 1 (1 shown)
2021-08-17T18:52:22.447Z [ERROR] blah blah
And at 2021-08-17T18:53:
Period: 2021-08-17T18:52:30.629Z - 2021-08-17T18:53:30.629Z
Total hits: 1 (1 shown)
2021-08-17T18:52:22.447Z [ERROR] blah blah
The second email is weird as the matched result is out of time range (2021-08-17T18:52:30.629Z - 2021-08-17T18:53:30.629Z) and is already captured by the first email.
Question2:
As far as I know, {{period_end}}
is set to the time when the monitor runs and {{period_start}}
is evaluated by{{period_end}}
minus interval.
Is it possible that an event is not captured by neither or two consecutive monitor runs?
For example, one monitor runs with period (2021-08-17T18:51:30.629Z - 2021-08-17T18:52:30.629Z) and the next period runs with period (2021-08-17T18:52:30.929Z - 2021-08-17T18:53:30.929Z) so that an event occurred at 2021-08-17T18:52:30.800Z is not captured by neither of monitors. Can this scenario happen?
Other plugins installed No.
To Reproduce
As explained in Bug
section.
Expected behavior Each monitor trigger's time range should not overlap.
Screenshots If applicable, add screenshots to help explain your problem.
Desktop (please complete the following information):
- OS: macOS 11.2.3
- Browser: chrome
- Version: 91.0.4472.114
Additional context Add any other context about the problem here.
Edit on 2021/09/03:
I found that some events are not captured by monitor at all. I seems that current behavior might work like below.
Trigger time | period variable (period_start/end ) |
actual query period |
---|---|---|
09-03 00:01 | 09-03 00:00 - 00:01 | 09-03 00:00 - 00:01 |
09-03 00:02 | 09-03 00:01 - 00:02 | 09-03 00:00 - 00:01 <- causes trigger miss or duplicate trigger |
09-03 00:03 | 09-03 00:02 - 00:03 | 09-03 00:02 - 00:03 |
Detailed situation:
Whenever network error (such as time out) happens in my app server, both [Error log] and [Access log] are printed.
Elastic data:
2021-09-03 10:30:15 [Error log] Actual error message 2021-09-03 10:30:15 [Access log] Web server access log whose status-code is 500
Two Monitors are set:
- Monitor for error log
- Monitor for 5xx status access log
Actual trigger:
For network error events, both monitors should trigger. But sometimes monitor 2 is not triggered whereas monitor 1 does. :(
The query setting seems right because from Edit Monitor
page, I replace 'period_start' and 'period_end' with missed period and run, then log data are corrected read.