icinga2 icon indicating copy to clipboard operation
icinga2 copied to clipboard

Bugfix: configobject shutdown order

Open w1ll-i-code opened this issue 1 year ago • 3 comments

This PR fixes #10179.

Shutdown the object in the correct order.

During a restart the objects are shut down in the reversed order they should be. With this the CheckerComponent is still updating objects after the IDO, Icingadb, Notifications, etc. have already been shutdown, leading to icinga2 getting out of sync with external components.

Increase activation_priority for key config objects to ensure correct startup and shutdown order

Make sure all configobjects are started and shut down in the right order, so that each object has its possible dependencies loaded before starting execution.

w1ll-i-code avatar Oct 17 '24 07:10 w1ll-i-code

Thank you for your pull request. Before we can look at it, you'll need to sign a Contributor License Agreement (CLA).

Please follow instructions at https://icinga.com/company/contributor-agreement to sign the CLA.

After that, please reply here with a comment and we'll verify.

Contributors that have not signed yet: @w1ll-i-code

  • If you've already signed a CLA, it's possible we don't have your GitHub username or you're using a different email address. Please contact us if you think this is the case.

  • If you signed the CLA as a corporation, your GitHub username may not have been submitted to us. Please reach out to the responsible person in your organization.

cla-bot[bot] avatar Oct 17 '24 07:10 cla-bot[bot]

How do I retrigger the CLA check?

w1ll-i-code avatar Oct 17 '24 11:10 w1ll-i-code

@cla-bot check

bobapple avatar Oct 17 '24 12:10 bobapple

Hi @w1ll-i-code,

Thanks for the PR. We will look at the changes as soon as possible.

Best regards, Eric

lippserd avatar Oct 22 '24 13:10 lippserd

Hi, thank you for your contribution!

With this the CheckerComponent is still updating objects after the IDO, Icingadb, Notifications, etc. have already been shutdown, leading to icinga2 getting out of sync with external components.

Can you confirm that you have tested this PR and that it resolves your problem described in the referenced issue? Did you notice that the checker is also subscribed to the checkable events? Meaning, this shouldn't provide you any benefit, since the checker will drop the checkables from its queue if they are stopped first.

The primary issue here is not the deactivation order of the objects, but rather how they are distributed in a cluster setup. To illustrate, while a host object is active on one node, its associated notification object can become active on another node. Consequently, each time the host object encounters a problem state, this node must sync the information to the other node to trigger a notification. However, such cluster events can disappear mysteriously if the server is restarted or reloaded at a wrong time. There is already an issue (#10157) and a draft PR #10161 to address this.

yhabteab avatar Oct 22 '24 15:10 yhabteab

Yes, I can confirm that this PR did fix the issues we were experiencing, at least from the specific tests I run and which can be found attached to the original issue. I was hoping on a sooner response, but the fix is now running in production and I am sure if anything is still missing, I'll get some feedback in. With the setup I provided in the reproduce, I could observe several lost entries to the ido on every restart. A problem which did no longer occur, once I made the change to the shutdown order. Knowing that I also took the liberty to rearrange some other objects in the startup/shutdown order to behave in accordance with my findings.

Did you notice that the checker is also subscribed to the checkable events?

I have to admit, I am usually not a C++ developer and have had no prior experience with this particular part of the icinga2 code base. I did, in fact, not know that. However I can tell you that this particular fix did resolve the issue, from the test I did perform...

The primary issue here is not the deactivation order of the objects, but rather how they are distributed in a cluster setup.

I don't think that's the case, as neither I for the reproducing of the issue, nor the client first reporting this issue have a cluster setup for icinga2. You can have a look at the docker-compose I attached to the issue, to reproduce the bug, to confirm this for yourself. This can be observed on a single icinga2 instance running by its own, without any other cluster nodes or even satellites.

w1ll-i-code avatar Oct 23 '24 07:10 w1ll-i-code

Yes, I can confirm that this PR did fix the issues we were experiencing, at least from the specific tests I run and which can be found attached to the original issue.

If this PR indeed fix your specific issue, it is likely not due to the deactivation priority of the checker or the API listener, but maybe the new priority order of the DbConnection class and the notification component. As I already said, if the checkable is stopped first the checker will immediately drop it from its queues. Likewise, the change for the API listener or the endpoint shouldn't be necessary at all, as you don't want to accept new connections while you're on the shutdown phase.

Once the rust program has found some missing state_history and/or notification, you can verify that by looking at the service history.

Anyway, can you please explain in #10179 what exactly you are missing in that screenshot of Icinga DB Web (https://github.com/Icinga/icinga2/issues/10179#issuecomment-2411081374)? As far as I can see, there is nothing wrong with that service history and I actually did try to reproduce the issue on my end as you described, but your Rust script just keeps crashing on startup:

[15823/144776] Analyzing state changes for host-icinga-lost-statehistory-00946!service-icinga-lost-statechange-05
[15824/144776] Analyzing state changes for icinga2!disk
thread 'main' panicked at src/icingadb.rs:133:75:
called `Option::unwrap()` on a `None` value

But at least it claimed to have found an error for host-icinga-lost-statehistory-00868!service-icinga-lost-statechange-04:

[217/144776] Analyzing state changes for host-icinga-lost-statehistory-00868!service-icinga-lost-statechange-04
Idk what's going on host-icinga-lost-statehistory-00868!service-icinga-lost-statechange-04: ObjectNotifications { service_id: "004D960037D7EAE178416B5CD110F13F28FD1542", hard_state_changes: 1, notification_count: 2 }
Bildschirmfoto 2024-10-23 um 12 48 47

As you can see in Icinga DB Web, this is not an error, the first notification is triggered due to the state change from PENDING to WARNING and the second one is a reminder notification, since you did not set the notification.interval to zero, you will be re-notified after exactly 30m, which is the case here.

{
    "results": [
        {
            "attrs": {
                "__name": "host-icinga-lost-statehistory-00868!service-icinga-lost-statechange-04!apply-nt-log-notifications",
                "active": true,
                "command": "nc-log-notification",
                "ha_mode": 0,
                "host_name": "host-icinga-lost-statehistory-00868",
                "interval": 1800,
...

yhabteab avatar Oct 23 '24 10:10 yhabteab

If this PR indeed fix your specific issue, it is likely not due to the deactivation priority of the checker or the API listener, but maybe the new priority order of the DbConnection class and the notification component.

No it is not. I made those changes only after I confirmed it works with the reversed shutdown order....

Anyway, can you please explain in https://github.com/Icinga/icinga2/issues/10179 what exactly you are missing in that screenshot of Icinga DB Web (https://github.com/Icinga/icinga2/issues/10179#issuecomment-2411081374)?

I'm sorry, I should have clarified more. From the bottom up, you see a hard state change from ok to warning and the notification that was sent. After that, you see a state change from hard ok to soft warning, meaning the service had a previous hard state changed to ok and it and the corresponding notification are not present in the history.

I actually did try to reproduce the issue on my end as you described, but your Rust script just keeps crashing on startup

had removed those services manually as I did not need them. I did not think that it would break my script.

since you did not set the notification.interval to zero, you will be re-notified after exactly 30m, which is the case here.

IYou are right, I only changed the notification interval on the test machine, but forgot to do the same locally before sending the basket. I apologize for that...

w1ll-i-code avatar Oct 23 '24 13:10 w1ll-i-code

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: William Calliari. This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email [email protected]
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

cla-bot[bot] avatar Nov 11 '24 09:11 cla-bot[bot]

@Al2Klimov Thanks for the hints. Now everything should be in order.

w1ll-i-code avatar Nov 11 '24 09:11 w1ll-i-code

Somewhere down the line while applying your suggested improvements, the fix broke again. I'll try to reproduce exactly what went wrong.

w1ll-i-code avatar Nov 13 '24 08:11 w1ll-i-code

Hi @yhabteab how did the meeting yesterday go? Did you come to any resolution regarding this issue?

w1ll-i-code avatar Nov 27 '24 09:11 w1ll-i-code

Hi @yhabteab how did the meeting yesterday go? Did you come to any resolution regarding this issue?

Hi @w1ll-i-code, I'm afraid that changing the order of activation_priority won't fully solve the problem, or if it partially does, it will break the existing activation_priority on startup. I'm really sorry for the inconvenience, but I think the underlying problem is much deeper than just changing the activation and deactivation order of objects. It's going to take a lot more time and testing to fix it properly. I don't want to say that this PR is irrelevant, but even with this PR, remotely generated events will still get lost, so we'd love to put in more effort and provide a fix in addition to this PR. However, as we are currently fully occupied with preparations for the next Icinga 2 bugfix release, which unfortunately does not include this, we'll give you an update after the next 2-3 weeks once Icinga 2.14.4 is ready.

yhabteab avatar Nov 27 '24 12:11 yhabteab

@yhabteab Thanks for the quick reply. I have still some questions:

  1. What exactly do you mean with remotely generated events will still get lost? Messages between two master nodes? Or are you talking about CheckResults coming in over the ApiListener?
  2. What does it will break the existing activation_priority on startup. mean exactly? What isn't working anymore afterwards?

w1ll-i-code avatar Nov 27 '24 13:11 w1ll-i-code

  1. What exactly do you mean with remotely generated events will still get lost? Messages between two master nodes? Or are you talking about CheckResults coming in over the ApiListener?

Yes, I'm referring to check result events coming from other endpoints.

  1. What does it will break the existing activation_priority on startup. mean exactly? What isn't working anymore afterwards?

Not exactly something is not working, but so far IcingaDB is activated after hosts, services, downtimes etc. (originally I thought it was actually the other way round, i.e. IcingaDB is activated first and then the other objects, that's my mistake), but with this PR its activation priority has been set to -50, which means it will definitely be activated before anything else and ~~probably run into problems like in https://github.com/Icinga/icinga2/pull/10151 more prominently~~ (edit: NVM! that's irrelevant here, but I'll have to test it out properly, which I don't have that much time for at the moment, but I'll give you an update in the coming weeks).

yhabteab avatar Nov 28 '24 13:11 yhabteab

Yes, I'm referring to check result events coming from other endpoints.

@yhabteab The ApiListener should already be shut down at that point (activation_priority 50), right? So it wouldn't impact it?

w1ll-i-code avatar Nov 28 '24 14:11 w1ll-i-code

Unfortunately, stopping the ApiListener object does not stop the listener from accepting new connections and does not immediately disconnect all already connected endpoints either. All the active connections will only be terminated when the process is terminated, not after the API listener has been stopped. @julianbrost was/is working on fixing this, i.e. when the listener is stopped, to manually disconnect all connected endpoints and reject any new connection attempts, but as mentioned before, this will require excessive testing and more time to invest in and will therefore not make it into the next bugfix release.

yhabteab avatar Nov 29 '24 09:11 yhabteab

@yhabteab This should actually not be a problem anymore, as the object is already deactivated at that point (priority 0), meaning it wouldn't change the internal state or dispatch events either way....

w1ll-i-code avatar Mar 10 '25 13:03 w1ll-i-code

Hi @w1ll-i-code, I'm really sorry for the long delay again!

As I said in https://github.com/Icinga/icinga2/pull/10191#discussion_r1856919768, we need to fix the issue you are encountering without having to change the current activation priority of the types. However, to achieve this, we need to adjust the class parser a bit so that a configuration type can provide both activation_priority and deactivation_priority at the same time if needed. Since I'm not sure if you are familiar with the parser or not, I just built support for it and pushed 60bcdc2 right now. Can you please cherry-pick this commit to your branch and revert all the changes you made to the activation_priority and use deactivation_priority instead and see if that fixes your problem? With my commit applied, e.g. Icinga DB could then define both as follows:

activation_priority 100;
deactivation_priority -50;

And don't forget to use GetDeactivationPriority() here instead: https://github.com/Icinga/icinga2/blob/3083a32bc638beea81c8013877f3c870204ff16d/lib/base/configobject.cpp#L589

EDIT: I have just pushed another commit, please use this f118110 instead of 60bcdc2.

yhabteab avatar Mar 17 '25 16:03 yhabteab

Thanks, I'll test it our first thing tomorrow. I'll let you know as soon as I can.

w1ll-i-code avatar Mar 17 '25 16:03 w1ll-i-code

Hi @yhabteab. I have tested it and it looks stable to me! Would you like me to include your commit in this PR or do I have to wait until that landed?

Sidenote: You need to include climits in the file class_parser.hpp for your commit to compile (at least on my system)

w1ll-i-code avatar Mar 18 '25 13:03 w1ll-i-code

Would you like me to include your commit in this PR or do I have to wait until that landed?

I think it is better to have a separate PR for this, it also simplifies the review process for the colleges, hence I have just https://github.com/Icinga/icinga2/pull/10385.

Sidenote: You need to include climits in the file class_parser.hpp for your commit to compile (at least on my system)

Thanks for the hint! I didn't get any compile errors on my Mac even without that include, so it seems to be a Linux thing 😅.

yhabteab avatar Mar 18 '25 14:03 yhabteab

Thank you (@w1ll-i-code) for your efforts and invested time on this but we managed to find a better way to address your issue. So this PR is superseded by https://github.com/Icinga/icinga2/pull/10460, therefore I'm closing it.

yhabteab avatar Jun 16 '25 15:06 yhabteab