fix: Reinstall rsyslog package instead of removing it
Cause: During the "Reset original confs" step, the role tried to remove
packages like rsyslog or iproute.
Consequence: Removal may fail because other packages require them (such as dhcp-client requiring iproute, or rsyslog-tls requiring rsyslog).
Fix: Reinstall instead of remove the rsyslog* packages during the reset
step, not other items from the __rsyslog_base_packages variable.
Remove the config directly so that reinstall restores it. For that to
work reliably, we need to purge the unowned config files before
reinstalling, so move that code up -- otherwise starting the service
will fail with colliding options between the default rsyslog.conf and
the old drop-ins.
Result: The role does not remove unrelated packages and works on Fedora ≥ 41 as well.
Note that neither the old nor this new approach work on OSTree -- resetting the config is difficult there and not currently supported.
Signed-off-by: Martin Pitt [email protected]
The test has always failed on F41 since its inception (see #358 and e. g. this log). This is quite a serious bug also on RHEL/CentOS.
[citest]
The C8s failure has the same issue than in #439 -- the logs are 404, so I take it the TF pipeline failed early. But there's no visible link to the TF URL.
After some digging, I found the corresponding workflow run, which has the URL. However, the test is still running, why is it shown as a failure here? https://www.githubstatus.com/ is ok. :thinking: ...
There are some further related failures like this one:
TASK [fedora.linux_system_roles.private_logging_subrole_rsyslog : Reset original confs - logging package is absent] ***
task path: /tmp/collections-lQ4/ansible_collections/fedora/linux_system_roles/roles/private_logging_subrole_rsyslog/tasks/main_core.yml:22
Wednesday 09 April 2025 01:51:07 -0400 (0:00:00.897) 0:00:50.370 *******
fatal: [managed-node1]: FAILED! => {
"changed": false,
"failures": [
"Problem: problem with installed package\n - installed package rsyslog-relp-8.2312.0-5.fc41.x86_64 requires rsyslog = 8.2312.0-5.fc41, but none of the providers can be installed\n - package rsyslog-relp-8.2312.0-5.fc41.x86_64 from fedora requires rsyslog = 8.2312.0-5.fc41, but none of the providers can be installed\n - conflicting requests"
Or the same thing with rsyslog-gnutls here.
I must say the whole "Reset original confs - logging package is absent" is a giant hack. Shouldn't we better set up the config directly, or make a backup of the original config or so?
This is a ratsnest, and I'll wait for the discussion first if we should even support this role on Fedora before I sink more time into it.
This now keeps failing on this bit:
TASK [Print possible errors in /var/log/messages] ******************************
task path: /tmp/collections-Q7A/ansible_collections/fedora/linux_system_roles/tests/logging/tasks/assert_varlogmessages.yml:9
Thursday 10 April 2025 03:54:21 -0400 (0:00:00.567) 0:01:33.120 ********
ok: [managed-node3] => {
"errors": "rsyslogd: error during parsing file /etc/rsyslog.d/00-global.conf, on or before line 12: parameter 'workdirectory' specified more than once - one instance is ignored. Fix config [v8.2102.0-15.el8 try https://www.rsyslog.com/e/2207 ]"
}
TASK [Ensure no errors in /var/log/messages] ***********************************
task path: /tmp/collections-Q7A/ansible_collections/fedora/linux_system_roles/tests/logging/tasks/assert_varlogmessages.yml:19
Thursday 10 April 2025 03:54:21 -0400 (0:00:00.158) 0:01:33.279 ********
fatal: [managed-node3]: FAILED! => {
"assertion": "'rsyslogd: error' not in __default_system_log_content.stdout",
"changed": false,
"evaluated_to": false
}
MSG:
Assertion failed
I can't reproduce this in local tox. I also tried tmt run --until report provision --how virtual --image fedora-41, but that does not work at all locally -- everything just fails, even when reducing to just one managed node. So I'll have to continue to do keyhole debugging via lots of PR pushes, sorry for the noise.
The first debug enabled log shows that at the time the assertion runs, the config is fine -- rsyslog.conf has a single workDirectory, and there is no 00-global.conf. So this is an error from an earlier run. Somehow removing and installing rsyslog does more than reinstalling it.
OTOH, sshing into the test machine at a random time acutally does have workDirectory= in both 00-global.conf and /etc/rsyslog.conf momentarily, so restarting the service (as it happens during reinstall) catches that error. I wonder if it helps to additionally clean up 00-global.conf? Trying.
Doing that at least made all the tox tests pass (also locally) :tada: .
:thinking: this C7 failure looks unrelated, retrying
[citest_bad]
not sure why - but this PR makes centos-7 imuxsock test fail - the last test before this one passed - https://dl.fedoraproject.org/pub/alt/linuxsystemroles/logs/tf_logging-440_CentOS-7-latest-2.9_20250410-130923/artifacts/tests_imuxsock_files-ANSIBLE-2.9-test_playbooks_parallel-FAIL.log is the new failure - https://dl.fedoraproject.org/pub/alt/linuxsystemroles/logs/tf_logging-358_CentOS-7-latest-2.9_20250405-081532/artifacts/tests_imuxsock_files-ANSIBLE-2.9-test_playbooks_parallel-SUCCESS.log is the last passing test
Dang. I thought it was a flake, as the test shows that the "Reset original confs - reinstall logging package" step is skipped (so the change here doesn't affect the run), and the "Install/Update required packages" shows that rsyslog and iproute are already installed. So I retried, but the test took a few hours to run yesterday, and I didn't see the failure any more.
This doesn't reproduce in tox -e qemu-ansible-2.9 -- --image-name centos-7 --log-level=debug tests/tests_imuxsock_files.yml, and running tmt tests locally is broken.
The whole "Restore /dev/log by restarting journald units" cannot work on RHEL 7, and in the previously passing test it was indeed skipped. There, the conditional "Check if /dev/log is present" step was true, i.e. /dev/log existed; that isn't the case in the failing run, there it doesn't exist, i.e. systemd-journald.socket is not running. Ironically that's precisely what the "Restore /dev/log by restarting journald units" step is trying to do, but it doesn't work for RHEL 7.
The corresponding TMT log is useful as it's likely some leftover state change from a previous test. And indeed re-running all these in that order does reproduce it:
tox -e qemu-ansible-2.9 -- --image-name centos-7 --log-level=debug tests/tests_version.yml tests/tests_files_files.yml tests/tests_basics_files.yml tests/tests_version.yml tests/tests_basics_forwards.yml tests/tests_imuxsock_files.yml
bisecting it down:
tox -e qemu-ansible-2.9 -- --image-name centos-7 --log-level=debug tests/tests_basics_forwards.yml tests/tests_imuxsock_files.yml
At the end of t_basiscs_forwards it restarts the units:
RUNNING HANDLER [/var/home/martin/upstream/lsr/logging/tests/roles/linux-system-roles.logging/roles/rsyslog : Restart rsyslogd] ***
TASK [Remove tempdir] **********************************************************
PLAY RECAP *********************************************************************
/var/home/martin/.cache/linux-system-roles/centos-7.qcow2c : ok=471 changed=68 unreachable=0 failed=0 skipped=513 rescued=1 ignored=0
But I added a test -e /dev/log at the start of tests_imuxsock_files.yml and that does pass. So it's not just a race condition between the test (not waiting for the restart to finish), something in _forwards changes the system state in a more sublte way.
I plastered the test with markers like
+ - name: XXX /dev/log is present 2
+ command: test -e /dev/log
and this happens with the second "Force all notified handlers to run" thing. sleeping/polling loop doesn't help. systemd-journald.socket is active, but /dev/log doesn't exist, so something during restarting rsyslog kills it.
systemctl restart systemd-journald.socket brings it back. systemctl restart rsyslog doesn't remove it again.
So I think that's the whole raison d'être of commit 2671779e8353642. Let's just fix it for RHEL 7, shall we?
[citest]
@richm all green now :green_heart:
[citest]
Thinking about the purge code some more, and remembering why we did this the way we did (it's been years) - I should have marked this code with HERE BE DRAGONS because it is not obvious. One of the reasons was to try to make the code as idempotent as possible - do not do any changes which are unnecessary - this new code will report a lot of changes that may be unnecessary, which changes the behavior that users will see. Part of that is the task name: Purge - remove files not generated by current state - one of the use cases for purge is that you specify the new config and tell the role that you want to wipe out the existing config. Well, what if you specify the exact same config that you are wiping out? That is essentially a no-op, and the role should report "changed": false. The old code tries to do that as much as possible.
So I think we need to take a step back here - fix the imuxsock issue on el7 which is pretty straightforward - and think about how to fix the purge issue in a separate PR.
One of the problems is that we do not have a good way to test idempotency. Ideally Ansible would expose some sort of "changed" counter so that you could run the role and ask "how many times was "changed": true reported" but there is no such mechanism. The only way to do it is to run ansible-playbook as a command or shell from within another playbook, capture the output, and look at the changed=N number. In order to fully develop a fix for this, we may have to add some extra variables that register: the various tasks that we are interested in, and check those in the test playbook.
This PR doesn't fundamentally change the approach -- if logging_purge_confs is false, nothing changes, as both the old and new code are conditionalized on logging_purge_confs == true. The changes are:
-
instead of removing and installing the packages, it uses an explicit "reinstall". The current main code is demonstrably broken -- on older releases it removes unrelated packages like iproute and reverse dependencies like dhcp-client, which can be disastrous. On newer releases
packages:doesn't remove a package if there area reverse deps, and the rule fails. (See the current failures). -
It moves the deletion of non-package config files earlier, so that after reinstall there is a valid confguration. It is still a mystery to me how the old code could get away with doing this so late. After removal and new install there was a time when rsyslog.conf was restored to default, but rsyslog.d/ was still old, which is often invalid and leads to a service startup failure. The QEMU tox tests fail left and right due to this, and I have no idea how the tmt tests manage to evade that.
I'm not proud of this hackery here, but it a least improves these things a lot. I don't see how to make this "perfect", it's a giant hack which necessitates some compromises.
Well, what if you specify the exact same config that you are wiping out? That is essentially a no-op, and the role should report "changed": false
This is not possible in principle, if you also specify logging_purge_confs: true. As we don't know the original config in advance, the rule has to remove and reinstall the package (either in two steps or in one, so both in main and this PR), which is fundamentally and necessarily changing the system.
After fighting with this for a week, this is my best shot at this problem. If it's not good enough, I admit defeat :white_flag: , and my only remaining proposal is to stop supporting logging_purge_confs on non-bootc systems.
This PR doesn't fundamentally change the approach
It depends on what you mean by "fundamentally". The new code will always remove /etc/rsyslog.conf if logging_purge_confs is true, even if it doesn't need to https://github.com/linux-system-roles/logging/pull/440/files#diff-86218b7c9c831a3bdd5513a24d268f2701a9d814094c4dd9b6219e58d8a04d92R8
The old code will first check if /etc/rsyslog.conf has been modified - https://github.com/linux-system-roles/logging/blob/main/roles/rsyslog/tasks/main_core.yml#L16 and will not remove the packages if /etc/rsyslog.conf is unchanged - https://github.com/linux-system-roles/logging/blob/main/roles/rsyslog/tasks/main_core.yml#L29
This is what makes the old code idempotent - do not change something if it is not necessary.
-- if
logging_purge_confsis false, nothing changes, as both the old and new code are conditionalized onlogging_purge_confs == true. The changes are:1. instead of removing and installing the packages, it uses an explicit "reinstall". The current main code is demonstrably broken -- on older releases it removes unrelated packages like iproute and reverse dependencies like dhcp-client, which can be disastrous. On newer releases `packages:` doesn't remove a package if there area reverse deps, and the rule fails. (See the current failures).
Agreed. This definitely needs to be fixed - only touch the packages that are absolutely necessary to be touched.
2. It moves the deletion of non-package config files earlier, so that after reinstall there is a valid confguration. It is still a mystery to me how the old code could get away with doing this so late.
It took a lot of time spent by Noriko and I to figure out how to do this in a way that was idempotent. And it does work.
After removal and new install there was a time when rsyslog.conf was restored to default, but rsyslog.d/ was still old, which is often invalid and leads to a service startup failure. The QEMU tox tests fail left and right due to this, and I have no idea how the tmt tests manage to evade that.
I'm not proud of this hackery here, but it a least improves these things a lot. I don't see how to make this "perfect", it's a giant hack which necessitates some compromises.
Well, what if you specify the exact same config that you are wiping out? That is essentially a no-op, and the role should report "changed": false
This is not possible in principle, if you also specify
logging_purge_confs: true. As we don't know the original config in advance, the rule has to remove and reinstall the package (either in two steps or in one, so both in main and this PR), which is fundamentally and necessarily changing the system.After fighting with this for a week, this is my best shot at this problem. If it's not good enough, I admit defeat 🏳️ , and my only remaining proposal is to stop supporting
logging_purge_confson non-bootc systems.
This PR doesn't fundamentally change the approach -- if
logging_purge_confsis false, nothing changes, as both the old and new code are conditionalized onlogging_purge_confs == true. The changes are:1. instead of removing and installing the packages, it uses an explicit "reinstall". The current main code is demonstrably broken -- on older releases it removes unrelated packages like iproute and reverse dependencies like dhcp-client, which can be disastrous. On newer releases `packages:` doesn't remove a package if there area reverse deps, and the rule fails. (See the current failures). 2. It moves the deletion of non-package config files earlier, so that after reinstall there is a valid confguration. It is still a mystery to me how the old code could get away with doing this so late. After removal and new install there was a time when rsyslog.conf was restored to default, but rsyslog.d/ was still old, which is often invalid and leads to a service startup failure. The QEMU tox tests fail left and right due to this, and I have no idea how the tmt tests manage to evade that.
The tests fail on fedora only? Because https://github.com/linux-system-roles/logging/pull/358 shows all of the EL tests passing.
I'm not proud of this hackery here, but it a least improves these things a lot. I don't see how to make this "perfect", it's a giant hack which necessitates some compromises.
I really don't want to make this less idempotent than it already is.
What you could do is to run https://github.com/linux-system-roles/logging/blob/main/tests/tests_purge_reset.yml with the current main HEAD code and see where "changed": true happens (on EL10, instead of fedora, just to use a working platform), then run the same test with your proposed changes, and compare the number of "changed": true.
Well, what if you specify the exact same config that you are wiping out? That is essentially a no-op, and the role should report "changed": false
This is not possible in principle, if you also specify
logging_purge_confs: true. As we don't know the original config in advance, the rule has to remove and reinstall the package (either in two steps or in one, so both in main and this PR), which is fundamentally and necessarily changing the system.
It is possible to know if /etc/rsyslog.conf is changed, using rpm -V, and to skip removing and re-installing the package if there are no changes.
After fighting with this for a week, this is my best shot at this problem. If it's not good enough, I admit defeat 🏳️ , and my only remaining proposal is to stop supporting
logging_purge_confson non-bootc systems.
Either use your proposed fix and break idempotency, or drop support for an important feature even on EL where it works? Are those my only two options?
It is possible to know if /etc/rsyslog.conf is changed, using rpm -V, and to skip removing and re-installing the package if there are no changes.
Ah, sorry, I completely misunderstood you earlier what you meant with "avoid changes". I can look at that again. I had that as the first version, but it breaks a lot of tests still, but now I can interpret the logs better. I wonder about some things there, e.g. it skips the "Reinstall package to restore rsyslog config if purging" step (the rpm -V check if logging_enabled is false, but this directly contradicts the documentation:
If you want to remove all the configuration files previously configured, in addition to setting
state: absentto each logging_inputs and logging_outputs item, addlogging_enabled: falseto the configuration variables as follows. It will eliminate the global and common configuration files, as well.
Changing that makes the tests a bit greener (it fails later), but I don't yet understand how this can pass currently.
The tests fail on fedora only? Because #358 shows all of the EL tests passing.
Yes, apparently removing a package with ansible/dnf5 changed behaviour quite a bit. The F41 failure there is what started this whole odyssey, but that failure is even earlier than what I run into with the above.
overall - looking much better - since we don't have a good way to test idempotency, you'll need to run the tests_purge_reset.yml test with the old code, then the new code, and examine the tasks with "changed": true
Thanks for the hints @richm ! TBH this isn't ready yet, it was more of a "I have something that makes tests_purge_reset.yml pass, and I want the bots' opinion" (plus, I needed to run out). I'll massage this some more.
[citest]
This new version works with the tox batch tests (see #439), but it fails with the serial tests in tmt. That is is "interesting". It assumes that this:
- name: END TEST CASE 2; Clean up the deployed config
vars:
logging_purge_confs: true
logging_inputs: []
logging_outputs: []
logging_flows: []
include_role:
name: linux-system-roles.logging
removes the rsyslog-elasticsearch package and removes the configuration:
TASK [/var/home/martin/upstream/lsr/logging/tests/roles/linux-system-roles.logging/roles/rsyslog : Reset original confs - logging package is absent] ***
task path: /var/home/martin/upstream/lsr/logging/roles/rsyslog/tasks/main_core.yml:22
Wednesday 16 April 2025 08:16:52 +0200 (0:00:00.858) 0:02:04.826 *******
changed: [/var/home/martin/.cache/linux-system-roles/centos-10.qcow2] => {
"changed": true,
"rc": 0,
"results": [
"Removed: rsyslog-8.2412.0-1.el10.x86_64",
"Removed: rsyslog-elasticsearch-8.2412.0-1.el10.x86_64",
"Removed: iproute-6.11.0-1.el10.x86_64"
]
}
This was a side effect of the config file restoration, not an explicit act. With dnf reinstall, reverse dependencies are not removed automatically any more.
Also, the current main branch leaves the system in an bad configuration state: /etc/rsyslog.conf was reset to the original version, but /etc/rsyslog.d/ still contains the role managed 10-output-files-modules.conf and 30-output-files-default_files.conf.
It reproduces locally with
tox -e qemu-ansible-core-2.17 -- --image-name centos-10 --log-level=debug tests/tests_files_elasticsearch.yml tests/tests_remote.yml
I learned my lesson, this remove behaviour is (1) dug too deeply into all the assumptions here, and (2) existing users may rely on it now. So I give up trying to clean this up further, and I just changed this to the minimal fix.
Ah, right, the other problem: package: state: absent does not remove reverse dependencies any more with dnf5. (same in tox/gh). Added as separate commit.
[citest]
This new version works with the tox batch tests (see #439), but it fails with the serial tests in tmt. That is is "interesting". It assumes that this:
- name: END TEST CASE 2; Clean up the deployed config vars: logging_purge_confs: true logging_inputs: [] logging_outputs: [] logging_flows: [] include_role: name: linux-system-roles.loggingremoves the rsyslog-elasticsearch package and removes the configuration:
TASK [/var/home/martin/upstream/lsr/logging/tests/roles/linux-system-roles.logging/roles/rsyslog : Reset original confs - logging package is absent] *** task path: /var/home/martin/upstream/lsr/logging/roles/rsyslog/tasks/main_core.yml:22 Wednesday 16 April 2025 08:16:52 +0200 (0:00:00.858) 0:02:04.826 ******* changed: [/var/home/martin/.cache/linux-system-roles/centos-10.qcow2] => { "changed": true, "rc": 0, "results": [ "Removed: rsyslog-8.2412.0-1.el10.x86_64", "Removed: rsyslog-elasticsearch-8.2412.0-1.el10.x86_64", "Removed: iproute-6.11.0-1.el10.x86_64" ] }This was a side effect of the config file restoration, not an explicit act. With
dnf reinstall, reverse dependencies are not removed automatically any more.Also, the current main branch leaves the system in an bad configuration state: /etc/rsyslog.conf was reset to the original version, but /etc/rsyslog.d/ still contains the role managed 10-output-files-modules.conf and 30-output-files-default_files.conf.
It reproduces locally with
tox -e qemu-ansible-core-2.17 -- --image-name centos-10 --log-level=debug tests/tests_files_elasticsearch.yml tests/tests_remote.yml
The tests aren't really designed to be run that way, because the tests assume they can use set_fact and "pollute" the global variable namespace, and that will be reset by the subsequent ansible-playbook invocation. If you run the tests sequentially like the above, then the global variables set by tests/tests_files_elasticsearch.yml will still be in the global namespace when tests/tests_remote.yml.
That being said, not sure if the behavior you are seeing is a result of global variables already being set, or something else. If you could reproduce it with --batch-file batch.txt with tests/tests_files_elasticsearch.yml and tests/tests_remote.yml being the files in batch.txt, that would be more indicative of a real error in the tests and possibly in the code that would need to be fixed.
I learned my lesson, this remove behaviour is (1) dug too deeply into all the assumptions here, and (2) existing users may rely on it now. So I give up trying to clean this up further, and I just changed this to the minimal fix.
I learned my lesson, this remove behaviour is (1) dug too deeply into all the assumptions here, and (2) existing users may rely on it now. So I give up trying to clean this up further, and I just changed this to the minimal fix.
I think it is worth revisiting this issue later - figuring out a better way to support purge mode and still be idempotent - maybe when you get to the image mode work.
The tests aren't really designed to be run that way
Yes I know, but it's a convenient way to reproduce the TMT errors, which run in a similar manner. Creating a batch usually works as well (I didn't try that here). The actual error was real
That being said, not sure if the behavior you are seeing is a result of global variables already being set,
No, not at all. The error was because the VM isn't reset between tests, and thus both the installed rsyslog-elasticsearch as well as some rsyslog.d/* files are still around after the elasticsearch test finishes.
Rebased to current main, to pick up the QEMU tests.
[citest]
[citest]