alerting [BUG] {{period_start}} and {{period

[BUG] {{period_start}} and {{period_end}} seems not respected

Open jngbng opened this issue 3 years ago • 1 comments

Describe the bug To receive error log messages, I created an alert using below query and run it on every minutes as mentioned in #187.

Extraction query:

{
    "size": 10,
    "query": {
        "bool": {
            "filter": [
                {
                    "range": {
                        "@timestamp": {
                            "from": "{{period_start}}||-1m",
                            "to": "{{period_end}}",
                            "include_lower": true,
                            "include_upper": true,
                            "format": "epoch_millis",
                            "boost": 1
                        }
                    }
                },
                {
                    "match": {
                        "log": {
                            "query": "error",
                            "operator": "OR",
                            "prefix_length": 0,
                            "max_expansions": 50,
                            "fuzzy_transpositions": true,
                            "lenient": false,
                            "zero_terms_query": "NONE",
                            "auto_generate_synonyms_phrase_query": true,
                            "boost": 1
                        }
                    }
                }
            ],
            "adjust_pure_negative": true,
            "boost": 1
        }
    },
    "aggregations": {}
}

Trigger message template:

Period: {{ctx.periodStart}} - {{ctx.periodEnd}}

{{#ctx.results}}
Total hits: {{hits.total.value}} ({{hits.hits.length}}{{hits.hits.size}} shown)

{{#hits.hits}}
{{#_source.log_processed}}
{{_source.@timestamp}} [{{level}}] {{logger}} - {{message}}
{{/_source.log_processed}}
{{/hits.hits}}
{{/ctx.results}}

Question1:

It seems work well, but I got two consecutive emails for one error log.

At 2021-08-17T18:52:

Period: 2021-08-17T18:51:30.629Z - 2021-08-17T18:52:30.629Z

Total hits: 1 (1 shown)

2021-08-17T18:52:22.447Z [ERROR] blah blah

And at 2021-08-17T18:53:

Period: 2021-08-17T18:52:30.629Z - 2021-08-17T18:53:30.629Z

Total hits: 1 (1 shown)

2021-08-17T18:52:22.447Z [ERROR] blah blah

The second email is weird as the matched result is out of time range (2021-08-17T18:52:30.629Z - 2021-08-17T18:53:30.629Z) and is already captured by the first email.

Question2:

As far as I know, {{period_end}} is set to the time when the monitor runs and {{period_start}} is evaluated by{{period_end}} minus interval.

Is it possible that an event is not captured by neither or two consecutive monitor runs?

For example, one monitor runs with period (2021-08-17T18:51:30.629Z - 2021-08-17T18:52:30.629Z) and the next period runs with period (2021-08-17T18:52:30.929Z - 2021-08-17T18:53:30.929Z) so that an event occurred at 2021-08-17T18:52:30.800Z is not captured by neither of monitors. Can this scenario happen?

Other plugins installed No.

To Reproduce As explained in Bug section.

Expected behavior Each monitor trigger's time range should not overlap.

Screenshots If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

OS: macOS 11.2.3
Browser: chrome
Version: 91.0.4472.114

Additional context Add any other context about the problem here.

Aug 18 '21 03:08 jngbng

Edit on 2021/09/03:

I found that some events are not captured by monitor at all. I seems that current behavior might work like below.

Trigger time	period variable (`period_start/end`)	actual query period
09-03 00:01	09-03 00:00 - 00:01	09-03 00:00 - 00:01
09-03 00:02	09-03 00:01 - 00:02	09-03 00:00 - 00:01 <- causes trigger miss or duplicate trigger
09-03 00:03	09-03 00:02 - 00:03	09-03 00:02 - 00:03

Detailed situation:

Whenever network error (such as time out) happens in my app server, both [Error log] and [Access log] are printed.

Elastic data:

2021-09-03 10:30:15 [Error log] Actual error message 2021-09-03 10:30:15 [Access log] Web server access log whose status-code is 500

Two Monitors are set:

Monitor for error log
Monitor for 5xx status access log

Actual trigger: For network error events, both monitors should trigger. But sometimes monitor 2 is not triggered whereas monitor 1 does. :( The query setting seems right because from Edit Monitor page, I replace 'period_start' and 'period_end' with missed period and run, then log data are corrected read.

Sep 03 '21 01:09 jngbng

alerting alerting copied to clipboard

[BUG] {{period_start}} and {{period_end}} seems not respected

Detailed situation:

alerting
alerting copied to clipboard