icinga2
icinga2 copied to clipboard
Fix broken downtime comment sync
All objects must be synced sorted by their load dependency. Otherwise, downtimes and/or comments might get synced before their respective Checkables, which will result in comments and downtimes being ignored by the other endpoint since it does not yet know about their checkables. Given that the runtime config updates event does not trigger a reload on the remote endpoint, these objects won't be synced again until the next reload.
~/master2/icinga2 (bundled-cluster-fixes ✗) ls prefix/var/lib/icinga2/api/packages/_api/*/conf.d/downtimes | wc -l
3501
~/master2/icinga2 (bundled-cluster-fixes ✗) curl -sSku root:icinga 'https://localhost:5666/v1/objects/downtimes?pretty=1' | grep ' "attrs": {' | wc -l
1501
~/master1/icinga2 (bundled-cluster-fixes ✗) curl -sSku root:icinga 'https://localhost:5665/v1/objects/downtimes?pretty=1' | grep ' "attrs": {' | wc -l
3501
After master2 reload:
~/master2/icinga2 (bundled-cluster-fixes ✗) curl -sSku root:icinga 'https://localhost:5666/v1/objects/downtimes?pretty=1' | grep ' "attrs": {' | wc -l
3501
closes #7786 closes #9873
TODO
- [x] #10148
- [x] rebase
Quick question: would this also fix the following issue: Downtimes scheduled via API (sometimes) get synced/re-created with delay and are doubled #10078
Quick question: would this also fix the following issue: Downtimes scheduled via API (sometimes) get synced/re-created with delay and are doubled #10078
It would be very helpful to get an answer to this.
Quick question: would this also fix the following issue: Downtimes scheduled via API (sometimes) get synced/re-created with delay and are doubled #10078
It would be very helpful to get an answer to this.
Hi, we don't know for sure whether this will fix #10078 as we still haven't identified exactly what is going wrong there, other than something is not working as expected. It's unlikely that this PR will fix #10078, but we can't tell you for sure until the cause for #10078 is identified.
@yhabteab When will this request be completed? Is there a timeline?
I believe similar problems will still exist for other types where there's no load_after dependency at the moment: many objects can refer to a time period, however, there's not a single load_after TimePeriod. There are more such examples: Host/Service -> *Command, Notification -> NotificationCommand (not necessarily a complete list).
A complete list of the navigable aka navigation types (attributes):
~/Workspace/icinga2 (broken-downtime-comment-sync ✗) grep -rE '\[.*navigation.*\]' lib
lib/icinga/dependency.ti: [config, no_user_modify, required, navigation(child_host)] name(Host) child_host_name {
lib/icinga/dependency.ti: [config, no_user_modify, navigation(child_service)] String child_service_name {
lib/icinga/dependency.ti: [config, no_user_modify, required, navigation(parent_host)] name(Host) parent_host_name {
lib/icinga/dependency.ti: [config, no_user_modify, navigation(parent_service)] String parent_service_name {
lib/icinga/dependency.ti: [config, navigation] name(TimePeriod) period (PeriodRaw) {
lib/icinga/checkable.ti: [config, required, navigation] name(CheckCommand) check_command (CheckCommandRaw) {
lib/icinga/checkable.ti: [config, navigation] name(TimePeriod) check_period (CheckPeriodRaw) {
lib/icinga/checkable.ti: [config, navigation] name(EventCommand) event_command (EventCommandRaw) {
lib/icinga/checkable.ti: [config, navigation] name(Endpoint) command_endpoint (CommandEndpointRaw) {
lib/icinga/downtime.ti: [config, no_user_modify, required, navigation(host)] name(Host) host_name {
lib/icinga/downtime.ti: [config, no_user_modify, navigation(service)] String service_name {
lib/icinga/notification.ti: [config, protected, required, navigation] name(NotificationCommand) command (CommandRaw) {
lib/icinga/notification.ti: [config, navigation] name(TimePeriod) period (PeriodRaw) {
lib/icinga/notification.ti: [config, no_user_modify, protected, required, navigation(host)] name(Host) host_name {
lib/icinga/notification.ti: [config, protected, no_user_modify, navigation(service)] String service_name {
lib/icinga/notification.ti: [config, navigation] name(Endpoint) command_endpoint (CommandEndpointRaw) {
lib/icinga/comment.ti: [config, no_user_modify, protected, required, navigation(host)] name(Host) host_name {
lib/icinga/comment.ti: [config, no_user_modify, protected, navigation(service)] String service_name {
lib/icinga/scheduleddowntime.ti: [config, protected, no_user_modify, required, navigation(host)] name(Host) host_name {
lib/icinga/scheduleddowntime.ti: [config, protected, no_user_modify, navigation(service)] String service_name {
lib/icinga/service.ti: [no_storage, navigation] Host::Ptr host {
lib/icinga/user.ti: [config, navigation] name(TimePeriod) period (PeriodRaw) {
lib/remote/zone.ti: [config, no_user_modify, navigation] name(Zone) parent (ParentRaw) {
A complete list of the navigable aka
navigationtypes (attributes):
Another list of non-navigable dependencies :):
~/Workspace/icinga2 (broken-downtime-comment-sync ✗) grep -rE 'array\(name.*' lib
lib/icinga/host.ti: [config, no_user_modify, required, signal_with_old_value] array(name(HostGroup)) groups {
lib/icinga/notification.ti: [config, signal_with_old_value] array(name(User)) users (UsersRaw);
lib/icinga/notification.ti: [config, signal_with_old_value] array(name(UserGroup)) user_groups (UserGroupsRaw);
lib/icinga/servicegroup.ti: [config, no_user_modify] array(name(ServiceGroup)) groups;
lib/icinga/hostgroup.ti: [config, no_user_modify] array(name(HostGroup)) groups;
lib/icinga/usergroup.ti: [config, no_user_modify] array(name(UserGroup)) groups;
lib/icinga/service.ti: [config, no_user_modify, required, signal_with_old_value] array(name(ServiceGroup)) groups {
lib/icinga/user.ti: [config, no_user_modify, required, signal_with_old_value] array(name(UserGroup)) groups {
lib/icinga/timeperiod.ti: [config, required, signal_with_old_value] array(name(TimePeriod)) excludes {
lib/icinga/timeperiod.ti: [config, required, signal_with_old_value] array(name(TimePeriod)) includes {
lib/remote/zone.ti: [config] array(name(Endpoint)) endpoints (EndpointsRaw);
While cross-checking[^1] the dependencies added by this PR, I both noticed another problem as well a different idea to fix this in a hopefully more reliable way.
The problem: There are object types that allow referencing other types of the same type. The first example of this I found is HostGroup: it has a groups attribute where other host groups can be referenced to recursively define groups:
https://github.com/Icinga/icinga2/blob/5e9e0bbcdf8af79e367ab5b2296a0a13d887ff10/lib/icinga/hostgroup.ti#L22
Something very similar also exists for TimePeriod with the include/exclude attributes:
https://github.com/Icinga/icinga2/blob/5e9e0bbcdf8af79e367ab5b2296a0a13d887ff10/lib/icinga/timeperiod.ti#L27-L32
So no matter how we reorder the types, the issue that the order of individual objects within a type can also make a difference, i.e. the included/excluded time periods have to be sent, otherwise the same problem can happen there as well (and then continue to affect other objects that reference these time periods like hosts and services).
However, on a positive note, it looks like there's actually an easy solution to this: Icinga 2 already tracks these dependencies between individual config objects on a per-object basis (https://github.com/Icinga/icinga2/blob/master/lib/base/dependencygraph.hpp, https://github.com/Icinga/icinga2/blob/master/lib/base/dependencygraph.cpp). Thus, we should be able to solve both the problem that was the reason for starting this PR as well as the other one I just mentioned by sending the config objects in a topological sort order in respect to DependencyGraph (that should actually remove the requirement to iterate the types in a particular order at all).
[^1]: These two commands were my friends (just so that they are not only forgotten in my shell history): git grep -E 'name\([A-Za-z]+\)' -- '*.ti' and git grep load_after -- '*.ti'
Thus, we should be able to solve both the problem that was the reason for starting this PR as well as the other one I just mentioned by sending the config objects in a topological sort order in respect to
DependencyGraph
No, it won't!
I actually did notice this, but didn't want to go into it too much since it's a relationship on a per object basis and not on the actual types themselves. Icinga 2 happily accepts such a config, and none of the proposed alternatives is going to overcome it.
object HostGroup "foo" {
groups = [ "bar" ]
}
object HostGroup "bar" {
groups = [ "foo" ]
}
{
"attrs": {
"__name": "bar",
"display_name": "bar",
"groups": [
"foo"
],
"name": "bar",
"type": "HostGroup",
},
"name": "bar",
"type": "HostGroup"
},
{
"attrs": {
"__name": "foo",
"display_name": "foo",
"groups": [
"bar"
],
"name": "foo",
"type": "HostGroup",
},
"name": "foo",
"type": "HostGroup"
}
And this config!
object TimePeriod "included" {
excludes = ["excluded"]
ranges = { "2024-10-14" = "14:00-15:00" }
}
object TimePeriod "excluded" {
excludes = ["included"]
ranges = { "2024-10-14" = "13:00-15:00" }
}
There is absolutely nothing in the code that prevents two objects from becoming dependent on each other.
Thus, we should be able to solve both the problem that was the reason for starting this PR as well as the other one I just mentioned by sending the config objects in a topological sort order in respect to
DependencyGraphNo, it won't!
Indeed, it would not make every (currently) possible situation work, but at least for all objects where it's possible.
object HostGroup "foo" { groups = [ "bar" ] } object HostGroup "bar" { groups = [ "foo" ] }
object TimePeriod "included" { excludes = ["excluded"] ranges = { "2024-10-14" = "14:00-15:00" } } object TimePeriod "excluded" { excludes = ["included"] ranges = { "2024-10-14" = "13:00-15:00" } }
With these objects, it's simply impossible to order them such that each object is synced before all of its dependencies. And such config being accepted looks like a bug to me.
Note that this ordering is only relevant for runtime-created objects, so defining something like this in a config file wouldn't even be a problem here. However, you can also create the dependency cycle by modifying attributes at runtime (if allows, for nested host/service groups it isn't, the groups attribute is actually marked no_user_modify, but with the time period example, it's actually possible at the moment).
Fortunately, @julianbrost found out during testing that the objects are not being synchronised in the correct order, as they are supposed to be with this PR. Thus, I made some modifications in the DependencyGraph class to achieve the desired result without producing too many code changes.
Before
[2024-12-04 08:45:47 +0100] critical/config: Error: Validation failed for object 'dsl-groups' of type 'HostGroup'; Attribute 'groups': Object 'dns-groups' of type 'HostGroup' does not exist.
Location: in /Users/yhabteab/ClionProjects/icinga2/prefix/var/lib/icinga2/api/packages/_api/354bb4c7-3d94-4df3-bba3-add86e471cd3/conf.d/hostgroups/dsl-groups.conf: 2:2-2:26
/Users/yhabteab/ClionProjects/icinga2/prefix/var/lib/icinga2/api/packages/_api/354bb4c7-3d94-4df3-bba3-add86e471cd3/conf.d/hostgroups/dsl-groups.conf(1): object HostGroup "dsl-groups" {
/Users/yhabteab/ClionProjects/icinga2/prefix/var/lib/icinga2/api/packages/_api/354bb4c7-3d94-4df3-bba3-add86e471cd3/conf.d/hostgroups/dsl-groups.conf(2): groups = [ "dns-groups" ]
^^^^^^^^^^^^^^^^^^^^^^^^^
/Users/yhabteab/ClionProjects/icinga2/prefix/var/lib/icinga2/api/packages/_api/354bb4c7-3d94-4df3-bba3-add86e471cd3/conf.d/hostgroups/dsl-groups.conf(3): version = 1733298340.337465
/Users/yhabteab/ClionProjects/icinga2/prefix/var/lib/icinga2/api/packages/_api/354bb4c7-3d94-4df3-bba3-add86e471cd3/conf.d/hostgroups/dsl-groups.conf(4): zone = "master"
[2024-12-04 08:45:47 +0100] critical/config: 1 error
[2024-12-04 08:45:47 +0100] critical/ApiListener: Could not create object 'dsl-groups':
[2024-12-04 08:45:47 +0100] critical/ApiListener: Error: Validation failed for object 'dsl-groups' of type 'HostGroup'; Attribute 'groups': Object 'dns-groups' of type 'HostGroup' does not exist.
Location: in /Users/yhabteab/ClionProjects/icinga2/prefix/var/lib/icinga2/api/packages/_api/354bb4c7-3d94-4df3-bba3-add86e471cd3/conf.d/hostgroups/dsl-groups.conf: 2:2-2:26
/Users/yhabteab/ClionProjects/icinga2/prefix/var/lib/icinga2/api/packages/_api/354bb4c7-3d94-4df3-bba3-add86e471cd3/conf.d/hostgroups/dsl-groups.conf(1): object HostGroup "dsl-groups" {
/Users/yhabteab/ClionProjects/icinga2/prefix/var/lib/icinga2/api/packages/_api/354bb4c7-3d94-4df3-bba3-add86e471cd3/conf.d/hostgroups/dsl-groups.conf(2): groups = [ "dns-groups" ]
^^^^^^^^^^^^^^^^^^^^^^^^^
/Users/yhabteab/ClionProjects/icinga2/prefix/var/lib/icinga2/api/packages/_api/354bb4c7-3d94-4df3-bba3-add86e471cd3/conf.d/hostgroups/dsl-groups.conf(3): version = 1733298340.337465
/Users/yhabteab/ClionProjects/icinga2/prefix/var/lib/icinga2/api/packages/_api/354bb4c7-3d94-4df3-bba3-add86e471cd3/conf.d/hostgroups/dsl-groups.conf(4): zone = "master"
[2024-12-04 08:45:47 +0100] information/ConfigObjectUtility: Created and activated object 'dns-groups' of type 'HostGroup'.
After
[2024-12-04 08:51:01 +0100] information/ApiListener: Copying file 'master//_etc/overflow.conf' from config sync staging to production zones directory.
[2024-12-04 08:51:01 +0100] information/ApiListener: Copying file 'master//_etc/services.conf' from config sync staging to production zones directory.
[2024-12-04 08:51:02 +0100] information/ConfigObjectUtility: Created and activated object 'dns-groups' of type 'HostGroup'.
[2024-12-04 08:51:02 +0100] information/ConfigObjectUtility: Created and activated object 'dsl-groups' of type 'HostGroup'.
[2024-12-04 08:51:03 +0100] information/Application: Got reload command: Starting new instance.
why we shouldn't simply fix the terminology instead.
- Because we already did it: #10263