uyuni
uyuni copied to clipboard
Testsuite: test restart of postgres
What does this PR change?
This test restarts postgres service and verifies, that it is handled correctly in java and salt - applying highstate must work.
Links
This should catch the bugs discussed in https://github.com/SUSE/spacewalk/issues/20143
- [x] DONE
Changelogs
Make sure the changelogs entries you are adding are compliant with https://github.com/uyuni-project/uyuni/wiki/Contributing#changelogs and https://github.com/uyuni-project/uyuni/wiki/Contributing#uyuni-projectuyuni-repository
If you don't need a changelog check, please mark this checkbox:
- [x] No changelog needed
If you uncheck the checkbox after the PR is created, you will need to re-run changelog_test (see below)
Re-run a test
If you need to re-run a test, please mark the related checkbox, it will be unchecked automatically once it has re-run:
- [ ] Re-run test "changelog_test"
- [ ] Re-run test "backend_unittests_pgsql"
- [ ] Re-run test "java_pgsql_tests"
- [ ] Re-run test "schema_migration_test_pgsql"
- [ ] Re-run test "susemanager_unittests"
- [ ] Re-run test "javascript_lint"
- [ ] Re-run test "spacecmd_unittests"
This PR is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 10 days.
Rebase done.
The scenario is definitely too fast:
Feature: Restart the PostgreSQL service
Scenario: Restart the PostgreSQL database service # features/secondary/srv_restart_postgres.feature:6
This scenario ran at: 2023-10-25 15:45:58 +0200
Given I am authorized for the "Admin" section # features/step_definitions/navigation_steps.rb:425
systemctl restart postgresql.service returned status code = 0.
Output:
''
When I restart the "postgresql.service" service on "server" # features/step_definitions/command_steps.rb:656
And I wait until "postgresql" service is active on "server" # features/step_definitions/command_steps.rb:240
Initializing a twopence node for 'sle_minion'.
Host 'sle_minion' is alive with determined hostname dominik-min-sles and FQDN dominik-min-sles.mgr.suse.de
Node: dominik-min-sles, OS Version: 15-SP4, Family: sles
And I am on the Systems overview page of this "sle_minion" # features/step_definitions/navigation_steps.rb:437
And I follow "States" in the content area # features/step_definitions/navigation_steps.rb:326
And I click on "Apply Highstate" # features/step_definitions/navigation_steps.rb:283
Then I should see a "Applying the highstate has been scheduled." text # features/step_definitions/navigation_steps.rb:613
Timeout after 90 seconds (Timeout.timeout): I still see text 'Apply highstate scheduled by admin'
And I wait until event "Apply highstate scheduled by admin" is completed # features/step_definitions/common_steps.rb:135
execution expired (Timeout::Error)
./features/step_definitions/navigation_steps.rb:75:in `block (2 levels) in <top (required)>'
./features/support/commonlib.rb:94:in `block in repeat_until_timeout'
./features/support/commonlib.rb:83:in `repeat_until_timeout'
./features/step_definitions/navigation_steps.rb:72:in `/^I wait at most (\d+) seconds until I do not see "([^"]*)" text, refreshing the page$/'
./features/step_definitions/common_steps.rb:142:in `/^I wait (\d+) seconds until the event is picked up and (\d+) seconds until the event "([^"]*)" is completed$/'
./features/step_definitions/common_steps.rb:159:in `/^I wait at most (\d+) seconds until event "([^"]*)" is completed$/'
./features/step_definitions/common_steps.rb:136:in `/^I wait until event "([^"]*)" is completed$/'
features/secondary/srv_restart_postgres.feature:14:in `I wait until event "Apply highstate scheduled by admin" is completed'
=> /var/log/rhn/rhn_web_ui.log
:wave: Hello! Thanks for contributing to our project. Acceptance tests will take some time (aprox. 1h), please be patient :coffee: You can see the progress at the end of this page and at https://github.com/uyuni-project/uyuni/pull/6539/checks Once tests finish, if they fail, you can check :eyes: the cucumber report. See the link at the output of the action. You can also check the artifacts section, which contains the logs at https://github.com/uyuni-project/uyuni/pull/6539/checks.
If you are unsure the failing tests are related to your code, you can check the "reference jobs". These are jobs that run on a scheduled time with code from master. If they fail for the same reason as your build, it means the tests or the infrastructure are broken. If they do not fail, but yours do, it means it is related to your code.
Reference tests:
-
https://github.com/uyuni-project/uyuni/actions/workflows/acceptance_tests_secondary_parallel.yml?query=event%3Aschedule
-
https://github.com/uyuni-project/uyuni/actions/workflows/acceptance_tests_secondary.yml?query=event%3Aschedule
KNOWN ISSUES
Sometimes the build can fail when pulling new jar files from download.opensuse.org . This is a known limitation. Given this happens rarely, when it does, all you need to do is rerun the test. Sorry for the inconvenience.
For more tips on troubleshooting, see the troubleshooting guide.
Happy hacking! :warning: You should not merge if acceptance tests fail to pass. :warning:
I created a step for checking the taskomatic log as Oscar suggested.
Scenario: Restart the PostgreSQL database service
Given I am authorized for the "Admin" section
When I restart the "postgresql.service" service on "server"
And I wait until "postgresql" service is active on "server"
And I am on the Systems overview page of this "sle_minion"
And I follow "States" in the content area
And I click on "Apply Highstate"
Then I should see a "Applying the highstate has been scheduled." text
And I wait until I see "MinionActionExecutor" in file "/var/log/rhn/rhn_taskomatic_daemon.log" on "server" # <------ new
And I wait until event "Apply highstate scheduled by admin" is completed
When(/^I wait until I see "([^"]*)" in file "([^"]*)" on "([^"]*)"$/) do |text, file, host|
node = get_target(host)
repeat_until_timeout(message: "Entry #{text} in file #{file} on #{host} not found") do
_output, code = node.run("tail -n 10 #{file} | grep '#{text}' ", check_errors: false, verbose: true)
break if code.zero?
end
end
(...)
tail -n 10 /var/log/rhn/rhn_taskomatic_daemon.log | grep 'MinionActionExecutor' returned status code = 1.
Output:
''
And I wait until I see "MinionActionExecutor" in file "/var/log/rhn/rhn_taskomatic_daemon.log" on "server" # features/step_definitions/command_steps.rb:1626
Timeout after 250 seconds (repeat_until_timeout): Entry MinionActionExecutor in file /var/log/rhn/rhn_taskomatic_daemon.log on server not found (RuntimeError)
./features/support/commonlib.rb:100:in `block in repeat_until_timeout'
./features/support/commonlib.rb:83:in `repeat_until_timeout'
./features/step_definitions/command_steps.rb:1628:in `/^I wait until I see "([^"]*)" in file "([^"]*)" on "([^"]*)"$/'
features/secondary/srv_restart_postgres.feature:15:in `I wait until I see "MinionActionExecutor" in file "/var/log/rhn/rhn_taskomatic_daemon.log" on "server"'
And I wait until event "Apply highstate scheduled by admin" is completed # features/step_definitions/common_steps.rb:135
=> /var/log/rhn/rhn_web_ui.log
However, when we do not wait a certain amount of time before we execute the high state on the minion, this event will not get picked up, and therefore I do not see what I am looking for in the taskomatic log file. So when applying highstate while taskomatik is not fully reconnected, it does not trigger the high state and leaves it in the pending queue.
Any suggestions or comments, @srbarrios?
If we leave the timeout in, the scenario succeeds:
dominik-ctl:~/spacewalk/testsuite # cucumber features/secondary/srv_restart_postgres.feature
(..)
Scenario: Restart the PostgreSQL database service # features/secondary/srv_restart_postgres.feature:6
This scenario ran at: 2023-10-26 15:12:58 +0200
Given I am authorized for the "Admin" section # features/step_definitions/navigation_steps.rb:425
systemctl restart postgresql.service returned status code = 0.
Output:
''
When I restart the "postgresql.service" service on "server" # features/step_definitions/command_steps.rb:656
And I wait until "postgresql" service is active on "server" # features/step_definitions/command_steps.rb:240
And I wait for "60" seconds # features/step_definitions/common_steps.rb:19
Initializing a twopence node for 'sle_minion'.
Host 'sle_minion' is alive with determined hostname dominik-min-sles and FQDN dominik-min-sles.mgr.suse.de
Node: dominik-min-sles, OS Version: 15-SP4, Family: sles
And I am on the Systems overview page of this "sle_minion" # features/step_definitions/navigation_steps.rb:437
And I follow "States" in the content area # features/step_definitions/navigation_steps.rb:326
And I click on "Apply Highstate" # features/step_definitions/navigation_steps.rb:283
Then I should see a "Applying the highstate has been scheduled." text # features/step_definitions/navigation_steps.rb:613
tail -n 10 /var/log/rhn/rhn_taskomatic_daemon.log | grep 'MinionActionExecutor' returned status code = 0.
Output:
'2023-10-26 15:14:14,090 [DefaultQuartzScheduler_Worker-6] INFO com.redhat.rhn.taskomatic.task.MinionActionExecutor - Executing action: 149
'
And I wait until I see "MinionActionExecutor" in file "/var/log/rhn/rhn_taskomatic_daemon.log" on "server" # features/step_definitions/command_steps.rb:1626
And I wait until event "Apply highstate scheduled by admin" is completed # features/step_definitions/common_steps.rb:135
This scenario took: 114 seconds
1 scenario (1 passed)
10 steps (10 passed)
1m54.133s
Any suggestions or comments, @srbarrios?
Sorry took me a while to jump here. Yeah, so I proposed to monitor the taskomatic log at another point of time in fact. SO, something like:
Scenario: Restart the PostgreSQL database service
Given I am authorized for the "Admin" section
When I restart the "postgresql.service" service on "server"
And I wait until "postgresql" service is active on "server"
And I wait until I see "xxxxx" in file "/var/log/rhn/rhn_taskomatic_daemon.log" on "server"
And I am on the Systems overview page of this "sle_minion"
And I follow "States" in the content area
And I click on "Apply Highstate"
Then I should see a "Applying the highstate has been scheduled." text
And I wait until event "Apply highstate scheduled by admin" is completed
I tried to identify the text we need to look for by navigating into our product code, looking for tasko, quartz and postgresql words, but I couldn't find it. I would say we even need to add a new log to the Java side in order to properly monitor this reconnection from the database.
But... on python code, I found this debug log entry that can help with this monitoring:
https://github.com/uyuni-project/uyuni/blob/master/susemanager-utils/susemanager-sls/modules/pillar/suma_minion.py#L125
My guess is that's printed on salt-master logs
/var/log/salt/master also does not contain anything useful, we only the error messages.
dominik-srv:/ # tail /var/log/salt/master
2023-10-26 14:14:00,525 [salt.loaded.ext.engines.mgr_events:147 ][ERROR ][27280] salt.loaded.ext.engines.mgr_events: SSL connection has been closed unexpectedly
2023-10-26 14:14:00,525 [salt.loaded.ext.engines.mgr_events:151 ][ERROR ][27280] salt.loaded.ext.engines.mgr_events: Error commiting: connection already closed
2023-10-26 14:14:00,526 [salt.loaded.ext.engines.mgr_events:188 ][ERROR ][27280] salt.loaded.ext.engines.mgr_events: Diconnected from database. Trying to reconnect...
2023-10-26 14:14:00,773 [salt.loaded.ext.pillar.suma_minion:150 ][ERROR ][27293] Error on getting database pillar, trying again: ('SSL connection has been closed unexpectedly\n',)
2023-10-26 15:14:14,467 [salt.loaded.ext.engines.mgr_events:147 ][ERROR ][27280] salt.loaded.ext.engines.mgr_events: SSL connection has been closed unexpectedly
2023-10-26 15:14:14,467 [salt.loaded.ext.engines.mgr_events:151 ][ERROR ][27280] salt.loaded.ext.engines.mgr_events: Error commiting: connection already closed
2023-10-26 15:14:14,469 [salt.loaded.ext.engines.mgr_events:188 ][ERROR ][27280] salt.loaded.ext.engines.mgr_events: Diconnected from database. Trying to reconnect...
2023-10-26 15:14:14,717 [salt.loaded.ext.pillar.suma_minion:150 ][ERROR ][27285] Error on getting database pillar, trying again: ('SSL connection has been closed unexpectedly\n',)
We concluded in Slack that there is no easy solution for this right now, so I added the initial waiting time from Vladimir back but also introduced a step to check for the triggered high state event in the Taskomatik log file.
We need to disable this test for now, according to mc. It did cause too much issues in 4.3 where it already was merged. I will open a bug report and adjust the PR to comment the test out in the .yml file.
@raulillo82 This can also be merged at some point to align the branches with 4.3, too. It is not exactly a test suite fix, though. I did rebase the branch again.
Rebase done. Once the tests are finished, I will merge the PR.