icinga2 icon indicating copy to clipboard operation
icinga2 copied to clipboard

Icinga DB: on every check result update state only 1x, not 3x in a row

Open Al2Klimov opened this issue 1 year ago • 2 comments

Before (time: vertical, stack: horizontal):

  • Checkable::ExecuteCheck
    • Checkable::UpdateNextCheck
      • IcingaDB::NextCheckChangedHandler
        • HSET icinga:host:state
        • HSET icinga:checksum:host:state
        • ZADD icinga:nextupdate:host
    • RandomCheckTask::ScriptFunc
      • Checkable::ProcessCheckResult
        • Checkable::UpdateNextCheck
          • IcingaDB::NextCheckChangedHandler
            • HSET icinga:host:state
            • HSET icinga:checksum:host:state
            • ZADD icinga:nextupdate:host
        • IcingaDB::NewCheckResultHandler
          • HSET icinga:host:state
          • HSET icinga:checksum:host:state
          • ZADD icinga:nextupdate:host
    • IcingaDB::StateChangeHandler
      • XADD icinga:runtime:state
      • IcingaDB::ForwardHistoryEntries
        • XADD icinga:history:stream:state

After:

  • Checkable::ExecuteCheck
    • Checkable::UpdateNextCheck
      • IcingaDB::NextCheckChangedHandler
    • RandomCheckTask::ScriptFunc
      • Checkable::ProcessCheckResult
        • Checkable::UpdateNextCheck
          • IcingaDB::NextCheckChangedHandler
        • IcingaDB::NewCheckResultHandler
          • HSET icinga:host:state
          • HSET icinga:checksum:host:state
          • ZADD icinga:nextupdate:host
    • IcingaDB::StateChangeHandler
      • XADD icinga:runtime:state
      • IcingaDB::ForwardHistoryEntries
        • XADD icinga:history:stream:state

The first state + nextupdate (for overdue) update comes from next_check being set to now + interval immediately before doing the actual check (not to trigger it twice). This update is not only not important for the end user, but even inappropriate. The end user SHALL see next_check being e.g. in -4s, not 5m, as the check is running at the moment.

The second one is just redundant as IcingaDB::NewCheckResultHandler (the third one) is called anyway and will update state + nextupdate as well.

fixes #9501

Awesome

1662388048.691010 [0 127.0.0.1:50579] "HSET" "icinga:host:state" "9b5bb23ddee640aeb6f2e71be4336655ae3bd036" "{\"check_attempt\":1,\"check_commandline\":\"'random'\",\"check_source\":\"alexandersmbp2.int.netways.de\",\"check_timeout\":60000,\"environment_id\":\"dd1de01a2760747f4e82bf7fec93477dcb13e09c\",\"execution_time\":2,\"hard_state\":1,\"host_id\":\"9b5bb23ddee640aeb6f2e71be4336655ae3bd036\",\"id\":\"9b5bb23ddee640aeb6f2e71be4336655ae3bd036\",\"in_downtime\":false,\"is_acknowledged\":0,\"is_active\":true,\"is_flapping\":false,\"is_handled\":false,\"is_problem\":true,\"is_reachable\":true,\"last_state_change\":1662388039211,\"last_update\":1662388048686,\"latency\":3,\"next_check\":1662388058157,\"next_update\":1662388068168,\"normalized_performance_data\":\"time=1662388048.686628 value=590 value_1m=531 value_5m=472 uptime=9.477065\",\"output\":\"Hello from alexandersmbp2.int.netways.de. Icinga 2 has been running for 9 seconds. Version: v2.13.0-425-g60eb41ce9\",\"performance_data\":\"time=1662388048.686628 value=590 value_1m=531 value_5m=472 uptime=9.477065\",\"previous_hard_state\":0,\"previous_soft_state\":0,\"scheduling_source\":\"alexandersmbp2.int.netways.de\",\"severity\":2112,\"soft_state\":1,\"state_type\":1}"
1662388048.691232 [0 127.0.0.1:50579] "HSET" "icinga:checksum:host:state" "9b5bb23ddee640aeb6f2e71be4336655ae3bd036" "{\"checksum\":\"5dcfed0c38d806df8f97353e24998fe5cefbe782\"}"
1662388048.691344 [0 127.0.0.1:50579] "ZADD" "icinga:nextupdate:host" "1662388068.168299" "9b5bb23ddee640aeb6f2e71be4336655ae3bd036"

Even more awesome

{"results":[{"code":200,"status":"Successfully rescheduled check for object '0'."}]}
1662388202.253110 [0 127.0.0.1:50579] "HSET" "icinga:host:state" "9b5bb23ddee640aeb6f2e71be4336655ae3bd036" "{\"check_attempt\":1,\"check_commandline\":\"'random'\",\"check_source\":\"alexandersmbp2.int.netways.de\",\"check_timeout\":60000,\"environment_id\":\"dd1de01a2760747f4e82bf7fec93477dcb13e09c\",\"execution_time\":1,\"hard_state\":1,\"host_id\":\"9b5bb23ddee640aeb6f2e71be4336655ae3bd036\",\"id\":\"9b5bb23ddee640aeb6f2e71be4336655ae3bd036\",\"in_downtime\":false,\"is_acknowledged\":0,\"is_active\":true,\"is_flapping\":false,\"is_handled\":false,\"is_problem\":true,\"is_reachable\":true,\"last_state_change\":1662388190821,\"last_update\":1662388200298,\"latency\":4,\"next_check\":2000000000000,\"next_update\":2000000010013,\"normalized_performance_data\":\"time=1662388200.298245 value=498 value_1m=448.200000 value_5m=398.400000 uptime=161.088683\",\"output\":\"Hello from alexandersmbp2.int.netways.de. Icinga 2 has been running for 2 minutes and 41 seconds. Version: v2.13.0-425-g60eb41ce9\",\"performance_data\":\"time=1662388200.298245 value=498 value_1m=448.200000 value_5m=398.400000 uptime=161.088683\",\"previous_hard_state\":0,\"previous_soft_state\":0,\"scheduling_source\":\"alexandersmbp2.int.netways.de\",\"severity\":2112,\"soft_state\":1,\"state_type\":1}"
1662388202.253286 [0 127.0.0.1:50579] "HSET" "icinga:checksum:host:state" "9b5bb23ddee640aeb6f2e71be4336655ae3bd036" "{\"checksum\":\"a9e5fe2332862ce9b5d077cc1b6100a443bffda1\"}"
1662388202.253410 [0 127.0.0.1:50579] "ZADD" "icinga:nextupdate:host" "2000000010.013584" "9b5bb23ddee640aeb6f2e71be4336655ae3bd036"

Al2Klimov avatar Sep 05 '22 14:09 Al2Klimov

Looks like the IDO had the same issue in the past (#3870) and the fix there was to introduce Checkable::OnNextCheckUpdated additional to Checkable::OnNextCheckChanged: b960850ce353aaf16a7f1da574b955817b8fd1bd

So just reusing that signal should also do the job for Icinga DB.

julianbrost avatar Sep 05 '22 15:09 julianbrost

🐘.

@lippserd You should promote me to the Chief Wheel Reinvention Officer.

Al2Klimov avatar Sep 05 '22 15:09 Al2Klimov