alerting icon indicating copy to clipboard operation
alerting copied to clipboard

[BUG] {{period_start}} and {{period_end}} seems not respected

Open jngbng opened this issue 3 years ago • 1 comments

Describe the bug To receive error log messages, I created an alert using below query and run it on every minutes as mentioned in #187.

image

Extraction query:

{
    "size": 10,
    "query": {
        "bool": {
            "filter": [
                {
                    "range": {
                        "@timestamp": {
                            "from": "{{period_start}}||-1m",
                            "to": "{{period_end}}",
                            "include_lower": true,
                            "include_upper": true,
                            "format": "epoch_millis",
                            "boost": 1
                        }
                    }
                },
                {
                    "match": {
                        "log": {
                            "query": "error",
                            "operator": "OR",
                            "prefix_length": 0,
                            "max_expansions": 50,
                            "fuzzy_transpositions": true,
                            "lenient": false,
                            "zero_terms_query": "NONE",
                            "auto_generate_synonyms_phrase_query": true,
                            "boost": 1
                        }
                    }
                }
            ],
            "adjust_pure_negative": true,
            "boost": 1
        }
    },
    "aggregations": {}
}

Trigger message template:

Period: {{ctx.periodStart}} - {{ctx.periodEnd}}

{{#ctx.results}}
Total hits: {{hits.total.value}} ({{hits.hits.length}}{{hits.hits.size}} shown)

{{#hits.hits}}
{{#_source.log_processed}}
{{_source.@timestamp}} [{{level}}] {{logger}} - {{message}}
{{/_source.log_processed}}
{{/hits.hits}}
{{/ctx.results}}

Question1:

It seems work well, but I got two consecutive emails for one error log.

At 2021-08-17T18:52:

Period: 2021-08-17T18:51:30.629Z - 2021-08-17T18:52:30.629Z

Total hits: 1 (1 shown)

2021-08-17T18:52:22.447Z [ERROR] blah blah

And at 2021-08-17T18:53:

Period: 2021-08-17T18:52:30.629Z - 2021-08-17T18:53:30.629Z

Total hits: 1 (1 shown)

2021-08-17T18:52:22.447Z [ERROR] blah blah

The second email is weird as the matched result is out of time range (2021-08-17T18:52:30.629Z - 2021-08-17T18:53:30.629Z) and is already captured by the first email.

Question2:

As far as I know, {{period_end}} is set to the time when the monitor runs and {{period_start}} is evaluated by{{period_end}} minus interval.

Is it possible that an event is not captured by neither or two consecutive monitor runs?

For example, one monitor runs with period (2021-08-17T18:51:30.629Z - 2021-08-17T18:52:30.629Z) and the next period runs with period (2021-08-17T18:52:30.929Z - 2021-08-17T18:53:30.929Z) so that an event occurred at 2021-08-17T18:52:30.800Z is not captured by neither of monitors. Can this scenario happen?

Other plugins installed No.

To Reproduce As explained in Bug section.

Expected behavior Each monitor trigger's time range should not overlap.

Screenshots If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

  • OS: macOS 11.2.3
  • Browser: chrome
  • Version: 91.0.4472.114

Additional context Add any other context about the problem here.

jngbng avatar Aug 18 '21 03:08 jngbng

Edit on 2021/09/03:

I found that some events are not captured by monitor at all. I seems that current behavior might work like below.

Trigger time period variable (period_start/end) actual query period
09-03 00:01 09-03 00:00 - 00:01 09-03 00:00 - 00:01
09-03 00:02 09-03 00:01 - 00:02 09-03 00:00 - 00:01 <- causes trigger miss or duplicate trigger
09-03 00:03 09-03 00:02 - 00:03 09-03 00:02 - 00:03

Detailed situation:

Whenever network error (such as time out) happens in my app server, both [Error log] and [Access log] are printed.

Elastic data:

2021-09-03 10:30:15 [Error log] Actual error message 2021-09-03 10:30:15 [Access log] Web server access log whose status-code is 500

Two Monitors are set:

  1. Monitor for error log
  2. Monitor for 5xx status access log

Actual trigger: For network error events, both monitors should trigger. But sometimes monitor 2 is not triggered whereas monitor 1 does. :( The query setting seems right because from Edit Monitor page, I replace 'period_start' and 'period_end' with missed period and run, then log data are corrected read.

jngbng avatar Sep 03 '21 01:09 jngbng