sensu-plugins-pagerduty icon indicating copy to clipboard operation
sensu-plugins-pagerduty copied to clipboard

An issue that was a warning that goes to critical doesn't send the resolution state

Open DaveWitchalls opened this issue 7 years ago • 19 comments

I think this is semi by design, but any views appreciated.

We've come across this during testing. Using CPU use as an example;

If we set a warning threshold of 50%, an email gets sent. If it then goes to Critical at 70% the alert in PagerDuty is triggered. The problem for us is, if the CPU goes back to below the critical threshold, but still within the warning threshold, the resolution isn't sent PagerDuty as the incident is still live, even though not in a critical state.

Any thoughts or workarounds for this?

Thanks, Dave.

DaveWitchalls avatar Aug 10 '16 15:08 DaveWitchalls

Hmm I am seeing the same and need to spend some time looking into this. I wonder if maybe we can work around using event suppression rules?

majormoses avatar Mar 24 '17 17:03 majormoses

Taking a look these are the only valid actions: https://sensuapp.org/docs/0.25/reference/events.html#how-are-sensu-events-created

  • create
  • resolve
  • flapping

I honestly have not played much with flapping in sensu but this might help with some of the situations but not all. By default a handler has handle_flapping to true. https://sensuapp.org/docs/0.24/reference/handlers.html#handler-configuration

majormoses avatar Mar 24 '17 19:03 majormoses

I checked on the pagerduty side and event suppression can only be done on initial ingest.

majormoses avatar Mar 24 '17 22:03 majormoses

@eheydrick what are your thoughts? This is not really pagerduty specific nor can I think of a pagerduty specific work around.

majormoses avatar Mar 24 '17 22:03 majormoses

I checked with pageryduty and event spurpession will not serve as a work around. Unless we want to resolve on state change (I am torn on this). I think we might want to pose a generic question to the sensu community and see what people think.

majormoses avatar Mar 27 '17 13:03 majormoses

Hi,

The workaround we had to go for was to include the resolved status in warning alerts. Far from ideal, but it does the job!

Dave.

DaveWitchalls avatar Mar 27 '17 13:03 DaveWitchalls

@DaveWitchalls did you apply this via a filter or mutator? I have not had any time to really look into this much and would like to try out your workaround and see if it works for us.

majormoses avatar Mar 27 '17 19:03 majormoses

Hi @majormoses

I asked the guy who did the work for me and got the below, it's reasonably long winded as it doesn't make much sense out of context, hope it helps!

NOTES ABOUT OUR USE OF SENSU

  1. By default we only send alerts if the occurrences count reaches 5, however this is configurable using the "occurrences" check variable.
  2. By default we send reminder emails if the occurrences count is divisible by 20, again this is configurable using a "remind_every" custom check variable.
  3. PagerDuty handles it's own reminders/escalations.
  4. We send all alerts (WARNING/CRITICAL/RESOLVED) via email.
  5. We only send CRITICAL alerts for specific checks via PagerDuty (as defined by the pagerduty_alert_filter).

Therefore if an check state changes from CRITICAL to WARNING we use a mutator to generate a fake RESOLVE message to be sent to PagerDuty to clear the alert. In this situation PagerDuty will send a RESOLVED alert which will show the new status text as WARNING not OK, an email will also be sent showing the status text as WARNING.

For checks which only require email alerts use the following handlers list: "handlers": ["default", "mail_alert_handler", "mail_recovery_handler", "mail_resolve_on_warning_handler"]

For checks which require both email and PagerDuty alerts (critical only) use the following handlers list: "handlers": ["default", "mail_alert_handler", "mail_recovery_handler", "mail_resolve_on_warning_handler", "pagerduty_alert_handler", "pagerduty_recovery_handler", "pagerduty_resolve_on_warning_handler"]

FILTER CONFIGURATION

/etc/sensu/conf.d/filters/alert_filters.json

NOTE the use of a custom key/value pair - remind_every - which is setup on the check. If not set it will default to emailing reminders every 20 occurrences.

{
  "filters": {
    "mail_alert_filter": {
      "negate": false,
      "attributes": {
        "action": "create",
        "occurrences": "eval: value == :::check.occurrences|5::: || value % :::check.remind_every|20::: == 0"
      }
    },
    "pagerduty_alert_filter": {
      "negate": false,
      "attributes": {
        "check": {
          "status": 2
        },
        "action": "create",
        "occurrences": "eval: value == :::check.occurrences|5:::"
      }
    }
  }
}

/etc/sensu/conf.d/filters/recovery_filters.json

{
  "filters": {
    "recovery_filter": {
      "negate": false,
      "attributes": {
        "action": "resolve",
        "occurrences": "eval: value >= :::check.occurrences|5:::"
      }
    },
    "resolve_on_warning_filter": {
      "negate": false,
      "attributes": {
        "check": {
          "status": 1
        },
        "action": "create",
        "occurrences": "eval: value == 1"
      }
    }
  }
}

HANDLER CONFIGURATION

/etc/sensu/conf.d/handlers/mail_handlers.json

{
  "handlers": {
    "mail_alert_handler": {
      "type": "pipe",
      "command": "handler-mailer.rb -s 'SENSU TEST'",
      "filter": "mail_alert_filter"
    },
    "mail_recovery_handler": {
      "type": "pipe",
      "command": "handler-mailer.rb -s 'SENSU TEST'",
      "filter": "recovery_filter"
    },
    "mail_resolve_on_warning_handler": {
      "type": "pipe",
      "command": "handler-mailer.rb -s 'SENSU TEST'",
      "filter": "resolve_on_warning_filter",
      "mutator": "mail_resolve_on_warning_mutator"
    },
    "mail_handler": {
      "type": "pipe",
      "command": "handler-mailer.rb -s 'SENSU TEST'"
    }
  },
  "mailer": {
    "admin_gui": "https://xxx.xxx.xxx.xxx/",
    "mail_from": "[email protected]",
    "mail_to": ["[email protected]"],
    "smtp_address": "127.0.0.1",
    "smtp_port": "25",
    "smtp_domain": "xxxxxxxxx.xxxxxxxxx"
  }
}

/etc/sensu/conf.d/handlers/pagerduty_handlers.json

{
  "handlers": {
    "pagerduty_alert_handler": {
      "type": "pipe",
      "command": "handler-pagerduty.rb",
      "filter": "pagerduty_alert_filter"
    },
    "pagerduty_recovery_handler": {
      "type": "pipe",
      "command": "handler-pagerduty.rb",
      "filter": "recovery_filter"
    },
    "pagerduty_resolve_on_warning_handler": {
      "type": "pipe",
      "command": "handler-pagerduty.rb",
      "filter": "resolve_on_warning_filter",
      "mutator": "pagerduty_resolve_on_warning_mutator"
    },
    "pagerduty_handler": {
      "type": "pipe",
      "command": "handler-pagerduty.rb"
    }
  },
  "pagerduty": {
    "api_key": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
  }
}

MUTATOR CONFIGURATION

/etc/sensu/conf.d/mutators/mail_mutators.json

{
  "mutators": {
    "mail_resolve_on_warning_mutator": {
      "command": "/etc/sensu/mutators/mutator-mail-resolve-on-warning.rb"
    }
  }
}

/etc/sensu/conf.d/mutators/pagerduty_mutators.json

{
  "mutators": {
    "pagerduty_priority_override_mutator": {
      "command": "mutator-pagerduty-priority-override.rb"
    },
    "pagerduty_resolve_on_warning_mutator": {
      "command": "/etc/sensu/mutators/mutator-pagerduty-resolve-on-warning.rb"
    }
  }
}

MUTATORS

/etc/sensu/mutators/mutator-mail-resolve-on-warning.rb

#!/usr/bin/env ruby

require 'json'

module Sensu
  module Mutator
    class Mail
      class ResolveOnWarning
        def execute(input = STDIN)
          event = JSON.parse(input.read, symbolize_names: true)
          occurrences = event[:check][:occurrences]
          history = event[:check][:history]
          ### EDGE CASE CHECKS ###
          # Check if number of occurrences is > 20 (max history length) - if so change it to 20.
          occurrences = 20 if occurrences > 20
          # Exit if length of history < occurrences + 1 as it can't have been in an alert state beforehand.
          exit 1 if history.length < (occurrences+1)
          ########################
          test_array = Array.new(occurrences, "2")
          if history[0-(occurrences+1), occurrences] == test_array
            event[:action] = 'resolve'
            event[:check][:occurrences] = 1
            JSON.dump(event)
          else
            exit 1
          end
        end
      end
    end
  end
end

## Is called from Gem script. Program name is full path to this script
### __FILE__ is the initial script ran, which is
### /etc/sensu/mutators/mutator-mail-resolve-on-warning.rb
if $PROGRAM_NAME.include?(__FILE__.split('/').last)
  mutator = Sensu::Mutator::Mail::ResolveOnWarning.new
  puts mutator.execute
end

/etc/sensu/mutators/mutator-pagerduty-resolve-on-warning.rb

#!/usr/bin/env ruby

require 'json'

module Sensu
  module Mutator
    class PagerDuty
      class ResolveOnWarning
        def execute(input = STDIN)
          event = JSON.parse(input.read, symbolize_names: true)
          occurrences = event[:check][:occurrences]
          history = event[:check][:history]
          ### EDGE CASE CHECKS ###
          # Check if number of occurrences is > 20 (max history length) - if so change it to 20.
          occurrences = 20 if occurrences > 20
          # Exit if length of history < occurrences + 1 as it can't have been in an alert state beforehand.
          exit 1 if history.length < (occurrences+1)
          ########################
          test_array = Array.new(occurrences, "2")
          if history[0-(occurrences+1), occurrences] == test_array
            event[:action] = 'resolve'
            event[:check][:occurrences] = 1
            JSON.dump(event)
          else
            exit 1
          end
        end
      end
    end
  end
end

## Is called from Gem script. Program name is full path to this script
### __FILE__ is the initial script ran, which is
### /etc/sensu/mutators/mutator-pagerduty-resolve-on-warning.rb
if $PROGRAM_NAME.include?(__FILE__.split('/').last)
  mutator = Sensu::Mutator::PagerDuty::ResolveOnWarning.new
  puts mutator.execute
end

DaveWitchalls avatar Mar 28 '17 15:03 DaveWitchalls

@DaveWitchalls thanks for the info I will take a look and see if I can find some bastard amalgamation based on yours that works for us.

majormoses avatar Mar 28 '17 18:03 majormoses

interesting, though considering this I am not sure if its worth the effort for me right now to use a filter now that occurances is an extenstion: https://github.com/sensu-extensions/sensu-extensions-occurrences

majormoses avatar Mar 28 '17 21:03 majormoses

Sensu events now have a "occurrences_watermark", the Sensu built-in "occurrence" filter now uses it instead of "occurrences" for the purpose of the resolve action. These changes are in the Sensu Core 0.29 release.

portertech avatar Jul 11 '17 15:07 portertech

ok, cool we can have a reasonable path forward when 0.29 is supported by this plugin...

majormoses avatar Jul 11 '17 16:07 majormoses

ok, cool we can have a reasonable path forward when 0.29 is supported by this plugin...

@majormoses Any thoughts on whether/when that might happen? (for the record, Sensu skipped from 0.29 to 1.0.0 and subsequently 1.0.2 in July)

kshep avatar Sep 26 '17 15:09 kshep

@majormoses Any thoughts on whether/when that might happen? (for the record, Sensu skipped from 0.29 to 1.0.0 and subsequently 1.0.2 in July)

When someone is motivated enough and has the time to work on it. There are ~200 plugins and realistically 2 active maintainers (neither of us working for sensu) and we rely mostly on other community members to contribute.

Now beyond my boilerplate :man_shrugging: answer of when...I don't think it would be too hard taking a quick look at the plugin.

majormoses avatar Sep 26 '17 16:09 majormoses

Does this plugin not work with Sensu >= 0.29? Or is there just work required to support changes related specifically to this issue?

joshbenner avatar Jan 07 '18 23:01 joshbenner

This plugin does work with all recent versions of sensu, I am currently running sensu 1.1.1 and do not have issues. The comment referenced is regarding a work around for moving from warning -> critical -> warning. The idea was to auto resolve the incident and create a new one based on occurrences_watermark. This is still a hack and I do see that pagerduty does potentially have a better solution now: https://www.pagerduty.com/blog/dynamic-notifications but I have not looked into that enough to know what the limitations are.

majormoses avatar Jan 08 '18 17:01 majormoses

Hello from the PagerDuty Team! Following up to confirm this is a limitation in the PagerDuty API rather than in the Sensu integration itself. At the moment, incidents are immutable so the parent incident can't be updated when the severity changes. You can click into the newest alert itself to get the latest data.

I've submitted a feature request from the maintainer of this integration so our product team knows mutable incidents are important to our customers. If you have any questions, please feel free to reach out to [email protected].

ashleyabrooks avatar Jan 11 '18 20:01 ashleyabrooks

Thanks for confirming this.

majormoses avatar Jan 11 '18 21:01 majormoses

I recently did some testing with this issue and it looks like PagerDuty has updated their API to allow for escalating an incident from a warning/low urgency to critical/high if the Dynamic notifications based on alert severity option is used in the Assign and Notify configuration for a service.

For testing we triggered an alert start as a warning starting a low priority incident in PagerDuty. I then changed the alert to the critical threshold which escalated the incident to high urgency in PagerDuty. One thing to note is that it will not de-escalate back to a low urgency incident.

joe-armstrong avatar Jan 25 '21 16:01 joe-armstrong