icinga2 icon indicating copy to clipboard operation
icinga2 copied to clipboard

Fix broken downtime comment sync

Open yhabteab opened this issue 1 year ago • 11 comments

All objects must be synced sorted by their load dependency. Otherwise, downtimes and/or comments might get synced before their respective Checkables, which will result in comments and downtimes being ignored by the other endpoint since it does not yet know about their checkables. Given that the runtime config updates event does not trigger a reload on the remote endpoint, these objects won't be synced again until the next reload.

~/master2/icinga2 (bundled-cluster-fixes ✗) ls prefix/var/lib/icinga2/api/packages/_api/*/conf.d/downtimes | wc -l
    3501
~/master2/icinga2 (bundled-cluster-fixes ✗) curl -sSku root:icinga 'https://localhost:5666/v1/objects/downtimes?pretty=1' | grep ' "attrs": {' | wc -l
    1501
~/master1/icinga2 (bundled-cluster-fixes ✗) curl -sSku root:icinga 'https://localhost:5665/v1/objects/downtimes?pretty=1' | grep ' "attrs": {' | wc -l
    3501

After master2 reload:

~/master2/icinga2 (bundled-cluster-fixes ✗) curl -sSku root:icinga 'https://localhost:5666/v1/objects/downtimes?pretty=1' | grep ' "attrs": {' | wc -l
    3501

closes #7786 closes #9873

TODO

  • [x] #10148
  • [x] rebase

yhabteab avatar Feb 12 '24 13:02 yhabteab

Quick question: would this also fix the following issue: Downtimes scheduled via API (sometimes) get synced/re-created with delay and are doubled #10078

log1-c avatar Jul 22 '24 09:07 log1-c

Quick question: would this also fix the following issue: Downtimes scheduled via API (sometimes) get synced/re-created with delay and are doubled #10078

It would be very helpful to get an answer to this.

Mordecaine avatar Aug 13 '24 07:08 Mordecaine

Quick question: would this also fix the following issue: Downtimes scheduled via API (sometimes) get synced/re-created with delay and are doubled #10078

It would be very helpful to get an answer to this.

Hi, we don't know for sure whether this will fix #10078 as we still haven't identified exactly what is going wrong there, other than something is not working as expected. It's unlikely that this PR will fix #10078, but we can't tell you for sure until the cause for #10078 is identified.

yhabteab avatar Aug 13 '24 08:08 yhabteab

@yhabteab When will this request be completed? Is there a timeline?

dgiesselbach avatar Aug 22 '24 10:08 dgiesselbach

I believe similar problems will still exist for other types where there's no load_after dependency at the moment: many objects can refer to a time period, however, there's not a single load_after TimePeriod. There are more such examples: Host/Service -> *Command, Notification -> NotificationCommand (not necessarily a complete list).

julianbrost avatar Sep 26 '24 13:09 julianbrost

A complete list of the navigable aka navigation types (attributes):

~/Workspace/icinga2 (broken-downtime-comment-sync ✗) grep -rE '\[.*navigation.*\]' lib
lib/icinga/dependency.ti:       [config, no_user_modify, required, navigation(child_host)] name(Host) child_host_name {
lib/icinga/dependency.ti:       [config, no_user_modify, navigation(child_service)] String child_service_name {
lib/icinga/dependency.ti:       [config, no_user_modify, required, navigation(parent_host)] name(Host) parent_host_name {
lib/icinga/dependency.ti:       [config, no_user_modify, navigation(parent_service)] String parent_service_name {
lib/icinga/dependency.ti:       [config, navigation] name(TimePeriod) period (PeriodRaw) {
lib/icinga/checkable.ti:        [config, required, navigation] name(CheckCommand) check_command (CheckCommandRaw) {
lib/icinga/checkable.ti:        [config, navigation] name(TimePeriod) check_period (CheckPeriodRaw) {
lib/icinga/checkable.ti:        [config, navigation] name(EventCommand) event_command (EventCommandRaw) {
lib/icinga/checkable.ti:        [config, navigation] name(Endpoint) command_endpoint (CommandEndpointRaw) {
lib/icinga/downtime.ti: [config, no_user_modify, required, navigation(host)] name(Host) host_name {
lib/icinga/downtime.ti: [config, no_user_modify, navigation(service)] String service_name {
lib/icinga/notification.ti:     [config, protected, required, navigation] name(NotificationCommand) command (CommandRaw) {
lib/icinga/notification.ti:     [config, navigation] name(TimePeriod) period (PeriodRaw) {
lib/icinga/notification.ti:     [config, no_user_modify, protected, required, navigation(host)] name(Host) host_name {
lib/icinga/notification.ti:     [config, protected, no_user_modify, navigation(service)] String service_name {
lib/icinga/notification.ti:     [config, navigation] name(Endpoint) command_endpoint (CommandEndpointRaw) {
lib/icinga/comment.ti:  [config, no_user_modify, protected, required, navigation(host)] name(Host) host_name {
lib/icinga/comment.ti:  [config, no_user_modify, protected, navigation(service)] String service_name {
lib/icinga/scheduleddowntime.ti:        [config, protected, no_user_modify, required, navigation(host)] name(Host) host_name {
lib/icinga/scheduleddowntime.ti:        [config, protected, no_user_modify, navigation(service)] String service_name {
lib/icinga/service.ti:  [no_storage, navigation] Host::Ptr host {
lib/icinga/user.ti:     [config, navigation] name(TimePeriod) period (PeriodRaw) {
lib/remote/zone.ti:     [config, no_user_modify, navigation] name(Zone) parent (ParentRaw) {

yhabteab avatar Sep 26 '24 13:09 yhabteab

A complete list of the navigable aka navigation types (attributes):

Another list of non-navigable dependencies :):

~/Workspace/icinga2 (broken-downtime-comment-sync ✗) grep -rE 'array\(name.*' lib     
lib/icinga/host.ti:     [config, no_user_modify, required, signal_with_old_value] array(name(HostGroup)) groups {
lib/icinga/notification.ti:     [config, signal_with_old_value] array(name(User)) users (UsersRaw);
lib/icinga/notification.ti:     [config, signal_with_old_value] array(name(UserGroup)) user_groups (UserGroupsRaw);
lib/icinga/servicegroup.ti:     [config, no_user_modify] array(name(ServiceGroup)) groups;
lib/icinga/hostgroup.ti:        [config, no_user_modify] array(name(HostGroup)) groups;
lib/icinga/usergroup.ti:        [config, no_user_modify] array(name(UserGroup)) groups;
lib/icinga/service.ti:  [config, no_user_modify, required, signal_with_old_value] array(name(ServiceGroup)) groups {
lib/icinga/user.ti:     [config, no_user_modify, required, signal_with_old_value] array(name(UserGroup)) groups {
lib/icinga/timeperiod.ti:       [config, required, signal_with_old_value] array(name(TimePeriod)) excludes {
lib/icinga/timeperiod.ti:       [config, required, signal_with_old_value] array(name(TimePeriod)) includes {
lib/remote/zone.ti:     [config] array(name(Endpoint)) endpoints (EndpointsRaw);

yhabteab avatar Sep 26 '24 15:09 yhabteab

While cross-checking[^1] the dependencies added by this PR, I both noticed another problem as well a different idea to fix this in a hopefully more reliable way.

The problem: There are object types that allow referencing other types of the same type. The first example of this I found is HostGroup: it has a groups attribute where other host groups can be referenced to recursively define groups: https://github.com/Icinga/icinga2/blob/5e9e0bbcdf8af79e367ab5b2296a0a13d887ff10/lib/icinga/hostgroup.ti#L22

Something very similar also exists for TimePeriod with the include/exclude attributes: https://github.com/Icinga/icinga2/blob/5e9e0bbcdf8af79e367ab5b2296a0a13d887ff10/lib/icinga/timeperiod.ti#L27-L32

So no matter how we reorder the types, the issue that the order of individual objects within a type can also make a difference, i.e. the included/excluded time periods have to be sent, otherwise the same problem can happen there as well (and then continue to affect other objects that reference these time periods like hosts and services).

However, on a positive note, it looks like there's actually an easy solution to this: Icinga 2 already tracks these dependencies between individual config objects on a per-object basis (https://github.com/Icinga/icinga2/blob/master/lib/base/dependencygraph.hpp, https://github.com/Icinga/icinga2/blob/master/lib/base/dependencygraph.cpp). Thus, we should be able to solve both the problem that was the reason for starting this PR as well as the other one I just mentioned by sending the config objects in a topological sort order in respect to DependencyGraph (that should actually remove the requirement to iterate the types in a particular order at all).

[^1]: These two commands were my friends (just so that they are not only forgotten in my shell history): git grep -E 'name\([A-Za-z]+\)' -- '*.ti' and git grep load_after -- '*.ti'

julianbrost avatar Oct 08 '24 14:10 julianbrost

Thus, we should be able to solve both the problem that was the reason for starting this PR as well as the other one I just mentioned by sending the config objects in a topological sort order in respect to DependencyGraph

No, it won't!

I actually did notice this, but didn't want to go into it too much since it's a relationship on a per object basis and not on the actual types themselves. Icinga 2 happily accepts such a config, and none of the proposed alternatives is going to overcome it.

object HostGroup "foo" {
    groups = [ "bar" ]
}

object HostGroup "bar" {
    groups = [ "foo" ]
}
{
    "attrs": {
        "__name": "bar",
        "display_name": "bar",
        "groups": [
            "foo"
        ],
        "name": "bar",
        "type": "HostGroup",
    },
    "name": "bar",
    "type": "HostGroup"
},
{
    "attrs": {
        "__name": "foo",
        "display_name": "foo",
        "groups": [
          "bar"
        ],
        "name": "foo",
        "type": "HostGroup",
    },
    "name": "foo",
    "type": "HostGroup"
}

yhabteab avatar Oct 14 '24 12:10 yhabteab

And this config!

object TimePeriod "included" {
  excludes = ["excluded"]
  ranges = { "2024-10-14" = "14:00-15:00" }
}

object TimePeriod "excluded" {
  excludes = ["included"]
  ranges = { "2024-10-14" = "13:00-15:00" }
}

There is absolutely nothing in the code that prevents two objects from becoming dependent on each other.

yhabteab avatar Oct 14 '24 12:10 yhabteab

Thus, we should be able to solve both the problem that was the reason for starting this PR as well as the other one I just mentioned by sending the config objects in a topological sort order in respect to DependencyGraph

No, it won't!

Indeed, it would not make every (currently) possible situation work, but at least for all objects where it's possible.

object HostGroup "foo" {
    groups = [ "bar" ]
}

object HostGroup "bar" {
    groups = [ "foo" ]
}
object TimePeriod "included" {
  excludes = ["excluded"]
  ranges = { "2024-10-14" = "14:00-15:00" }
}

object TimePeriod "excluded" {
  excludes = ["included"]
  ranges = { "2024-10-14" = "13:00-15:00" }
}

With these objects, it's simply impossible to order them such that each object is synced before all of its dependencies. And such config being accepted looks like a bug to me.

Note that this ordering is only relevant for runtime-created objects, so defining something like this in a config file wouldn't even be a problem here. However, you can also create the dependency cycle by modifying attributes at runtime (if allows, for nested host/service groups it isn't, the groups attribute is actually marked no_user_modify, but with the time period example, it's actually possible at the moment).

julianbrost avatar Oct 14 '24 12:10 julianbrost

Fortunately, @julianbrost found out during testing that the objects are not being synchronised in the correct order, as they are supposed to be with this PR. Thus, I made some modifications in the DependencyGraph class to achieve the desired result without producing too many code changes.

Before

[2024-12-04 08:45:47 +0100] critical/config: Error: Validation failed for object 'dsl-groups' of type 'HostGroup'; Attribute 'groups': Object 'dns-groups' of type 'HostGroup' does not exist.
Location: in /Users/yhabteab/ClionProjects/icinga2/prefix/var/lib/icinga2/api/packages/_api/354bb4c7-3d94-4df3-bba3-add86e471cd3/conf.d/hostgroups/dsl-groups.conf: 2:2-2:26
/Users/yhabteab/ClionProjects/icinga2/prefix/var/lib/icinga2/api/packages/_api/354bb4c7-3d94-4df3-bba3-add86e471cd3/conf.d/hostgroups/dsl-groups.conf(1): object HostGroup "dsl-groups" {
/Users/yhabteab/ClionProjects/icinga2/prefix/var/lib/icinga2/api/packages/_api/354bb4c7-3d94-4df3-bba3-add86e471cd3/conf.d/hostgroups/dsl-groups.conf(2):  groups = [ "dns-groups" ]
                                                                                                                                                           ^^^^^^^^^^^^^^^^^^^^^^^^^
/Users/yhabteab/ClionProjects/icinga2/prefix/var/lib/icinga2/api/packages/_api/354bb4c7-3d94-4df3-bba3-add86e471cd3/conf.d/hostgroups/dsl-groups.conf(3):  version = 1733298340.337465
/Users/yhabteab/ClionProjects/icinga2/prefix/var/lib/icinga2/api/packages/_api/354bb4c7-3d94-4df3-bba3-add86e471cd3/conf.d/hostgroups/dsl-groups.conf(4):  zone = "master"
[2024-12-04 08:45:47 +0100] critical/config: 1 error
[2024-12-04 08:45:47 +0100] critical/ApiListener: Could not create object 'dsl-groups':
[2024-12-04 08:45:47 +0100] critical/ApiListener: Error: Validation failed for object 'dsl-groups' of type 'HostGroup'; Attribute 'groups': Object 'dns-groups' of type 'HostGroup' does not exist.
Location: in /Users/yhabteab/ClionProjects/icinga2/prefix/var/lib/icinga2/api/packages/_api/354bb4c7-3d94-4df3-bba3-add86e471cd3/conf.d/hostgroups/dsl-groups.conf: 2:2-2:26
/Users/yhabteab/ClionProjects/icinga2/prefix/var/lib/icinga2/api/packages/_api/354bb4c7-3d94-4df3-bba3-add86e471cd3/conf.d/hostgroups/dsl-groups.conf(1): object HostGroup "dsl-groups" {
/Users/yhabteab/ClionProjects/icinga2/prefix/var/lib/icinga2/api/packages/_api/354bb4c7-3d94-4df3-bba3-add86e471cd3/conf.d/hostgroups/dsl-groups.conf(2):  groups = [ "dns-groups" ]
                                                                                                                                                           ^^^^^^^^^^^^^^^^^^^^^^^^^
/Users/yhabteab/ClionProjects/icinga2/prefix/var/lib/icinga2/api/packages/_api/354bb4c7-3d94-4df3-bba3-add86e471cd3/conf.d/hostgroups/dsl-groups.conf(3):  version = 1733298340.337465
/Users/yhabteab/ClionProjects/icinga2/prefix/var/lib/icinga2/api/packages/_api/354bb4c7-3d94-4df3-bba3-add86e471cd3/conf.d/hostgroups/dsl-groups.conf(4):  zone = "master"
[2024-12-04 08:45:47 +0100] information/ConfigObjectUtility: Created and activated object 'dns-groups' of type 'HostGroup'.

After

[2024-12-04 08:51:01 +0100] information/ApiListener: Copying file 'master//_etc/overflow.conf' from config sync staging to production zones directory.
[2024-12-04 08:51:01 +0100] information/ApiListener: Copying file 'master//_etc/services.conf' from config sync staging to production zones directory.
[2024-12-04 08:51:02 +0100] information/ConfigObjectUtility: Created and activated object 'dns-groups' of type 'HostGroup'.
[2024-12-04 08:51:02 +0100] information/ConfigObjectUtility: Created and activated object 'dsl-groups' of type 'HostGroup'.
[2024-12-04 08:51:03 +0100] information/Application: Got reload command: Starting new instance.

yhabteab avatar Dec 04 '24 07:12 yhabteab

why we shouldn't simply fix the terminology instead.

  • Because we already did it: #10263

Al2Klimov avatar Dec 11 '24 16:12 Al2Klimov