uyuni icon indicating copy to clipboard operation
uyuni copied to clipboard

Testsuite: test restart of postgres

Open nadvornik opened this issue 2 years ago • 10 comments
trafficstars

What does this PR change?

This test restarts postgres service and verifies, that it is handled correctly in java and salt - applying highstate must work.

Links

This should catch the bugs discussed in https://github.com/SUSE/spacewalk/issues/20143

  • [x] DONE

Changelogs

Make sure the changelogs entries you are adding are compliant with https://github.com/uyuni-project/uyuni/wiki/Contributing#changelogs and https://github.com/uyuni-project/uyuni/wiki/Contributing#uyuni-projectuyuni-repository

If you don't need a changelog check, please mark this checkbox:

  • [x] No changelog needed

If you uncheck the checkbox after the PR is created, you will need to re-run changelog_test (see below)

Re-run a test

If you need to re-run a test, please mark the related checkbox, it will be unchecked automatically once it has re-run:

  • [ ] Re-run test "changelog_test"
  • [ ] Re-run test "backend_unittests_pgsql"
  • [ ] Re-run test "java_pgsql_tests"
  • [ ] Re-run test "schema_migration_test_pgsql"
  • [ ] Re-run test "susemanager_unittests"
  • [ ] Re-run test "javascript_lint"
  • [ ] Re-run test "spacecmd_unittests"

nadvornik avatar Jan 27 '23 14:01 nadvornik

This PR is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 10 days.

github-actions[bot] avatar Oct 25 '23 01:10 github-actions[bot]

Rebase done.

nodeg avatar Oct 25 '23 13:10 nodeg

The scenario is definitely too fast:

Feature: Restart the PostgreSQL service

  Scenario: Restart the PostgreSQL database service                          # features/secondary/srv_restart_postgres.feature:6
      This scenario ran at: 2023-10-25 15:45:58 +0200
    Given I am authorized for the "Admin" section                            # features/step_definitions/navigation_steps.rb:425
systemctl restart postgresql.service returned status code = 0.
Output:
''
    When I restart the "postgresql.service" service on "server"              # features/step_definitions/command_steps.rb:656
    And I wait until "postgresql" service is active on "server"              # features/step_definitions/command_steps.rb:240
Initializing a twopence node for 'sle_minion'.
Host 'sle_minion' is alive with determined hostname dominik-min-sles and FQDN dominik-min-sles.mgr.suse.de
Node: dominik-min-sles, OS Version: 15-SP4, Family: sles
    And I am on the Systems overview page of this "sle_minion"               # features/step_definitions/navigation_steps.rb:437
    And I follow "States" in the content area                                # features/step_definitions/navigation_steps.rb:326
    And I click on "Apply Highstate"                                         # features/step_definitions/navigation_steps.rb:283
    Then I should see a "Applying the highstate has been scheduled." text    # features/step_definitions/navigation_steps.rb:613
Timeout after 90 seconds (Timeout.timeout): I still see text 'Apply highstate scheduled by admin'
    And I wait until event "Apply highstate scheduled by admin" is completed # features/step_definitions/common_steps.rb:135
      execution expired (Timeout::Error)
      ./features/step_definitions/navigation_steps.rb:75:in `block (2 levels) in <top (required)>'
      ./features/support/commonlib.rb:94:in `block in repeat_until_timeout'
      ./features/support/commonlib.rb:83:in `repeat_until_timeout'
      ./features/step_definitions/navigation_steps.rb:72:in `/^I wait at most (\d+) seconds until I do not see "([^"]*)" text, refreshing the page$/'
      ./features/step_definitions/common_steps.rb:142:in `/^I wait (\d+) seconds until the event is picked up and (\d+) seconds until the event "([^"]*)" is completed$/'
      ./features/step_definitions/common_steps.rb:159:in `/^I wait at most (\d+) seconds until event "([^"]*)" is completed$/'
      ./features/step_definitions/common_steps.rb:136:in `/^I wait until event "([^"]*)" is completed$/'
      features/secondary/srv_restart_postgres.feature:14:in `I wait until event "Apply highstate scheduled by admin" is completed'
=> /var/log/rhn/rhn_web_ui.log

nodeg avatar Oct 25 '23 13:10 nodeg

:wave: Hello! Thanks for contributing to our project. Acceptance tests will take some time (aprox. 1h), please be patient :coffee: You can see the progress at the end of this page and at https://github.com/uyuni-project/uyuni/pull/6539/checks Once tests finish, if they fail, you can check :eyes: the cucumber report. See the link at the output of the action. You can also check the artifacts section, which contains the logs at https://github.com/uyuni-project/uyuni/pull/6539/checks.

If you are unsure the failing tests are related to your code, you can check the "reference jobs". These are jobs that run on a scheduled time with code from master. If they fail for the same reason as your build, it means the tests or the infrastructure are broken. If they do not fail, but yours do, it means it is related to your code.

Reference tests:

  • https://github.com/uyuni-project/uyuni/actions/workflows/acceptance_tests_secondary_parallel.yml?query=event%3Aschedule

  • https://github.com/uyuni-project/uyuni/actions/workflows/acceptance_tests_secondary.yml?query=event%3Aschedule

KNOWN ISSUES

Sometimes the build can fail when pulling new jar files from download.opensuse.org . This is a known limitation. Given this happens rarely, when it does, all you need to do is rerun the test. Sorry for the inconvenience.

For more tips on troubleshooting, see the troubleshooting guide.

Happy hacking! :warning: You should not merge if acceptance tests fail to pass. :warning:

github-actions[bot] avatar Oct 25 '23 14:10 github-actions[bot]

I created a step for checking the taskomatic log as Oscar suggested.

  Scenario: Restart the PostgreSQL database service
    Given I am authorized for the "Admin" section
    When I restart the "postgresql.service" service on "server"
    And I wait until "postgresql" service is active on "server"
    And I am on the Systems overview page of this "sle_minion"
    And I follow "States" in the content area
    And I click on "Apply Highstate"
    Then I should see a "Applying the highstate has been scheduled." text
    And I wait until I see "MinionActionExecutor" in file "/var/log/rhn/rhn_taskomatic_daemon.log" on "server"  # <------ new
    And I wait until event "Apply highstate scheduled by admin" is completed
When(/^I wait until I see "([^"]*)" in file "([^"]*)" on "([^"]*)"$/) do |text, file, host|
  node = get_target(host)
  repeat_until_timeout(message: "Entry #{text} in file #{file} on #{host} not found") do
    _output, code = node.run("tail -n 10 #{file} | grep '#{text}' ", check_errors: false, verbose: true)
    break if code.zero?
  end
end
(...)
tail -n 10 /var/log/rhn/rhn_taskomatic_daemon.log | grep 'MinionActionExecutor'  returned status code = 1.
Output:
''
    And I wait until I see "MinionActionExecutor" in file "/var/log/rhn/rhn_taskomatic_daemon.log" on "server" # features/step_definitions/command_steps.rb:1626
      Timeout after 250 seconds (repeat_until_timeout): Entry MinionActionExecutor in file /var/log/rhn/rhn_taskomatic_daemon.log on server not found (RuntimeError)
      ./features/support/commonlib.rb:100:in `block in repeat_until_timeout'
      ./features/support/commonlib.rb:83:in `repeat_until_timeout'
      ./features/step_definitions/command_steps.rb:1628:in `/^I wait until I see "([^"]*)" in file "([^"]*)" on "([^"]*)"$/'
      features/secondary/srv_restart_postgres.feature:15:in `I wait until I see "MinionActionExecutor" in file "/var/log/rhn/rhn_taskomatic_daemon.log" on "server"'
    And I wait until event "Apply highstate scheduled by admin" is completed                                   # features/step_definitions/common_steps.rb:135
=> /var/log/rhn/rhn_web_ui.log

However, when we do not wait a certain amount of time before we execute the high state on the minion, this event will not get picked up, and therefore I do not see what I am looking for in the taskomatic log file. So when applying highstate while taskomatik is not fully reconnected, it does not trigger the high state and leaves it in the pending queue.

image

Any suggestions or comments, @srbarrios?

nodeg avatar Oct 26 '23 12:10 nodeg

If we leave the timeout in, the scenario succeeds:

dominik-ctl:~/spacewalk/testsuite # cucumber features/secondary/srv_restart_postgres.feature
(..)

  Scenario: Restart the PostgreSQL database service                                                            # features/secondary/srv_restart_postgres.feature:6
      This scenario ran at: 2023-10-26 15:12:58 +0200
    Given I am authorized for the "Admin" section                                                              # features/step_definitions/navigation_steps.rb:425
systemctl restart postgresql.service returned status code = 0.
Output:
''
    When I restart the "postgresql.service" service on "server"                                                # features/step_definitions/command_steps.rb:656
    And I wait until "postgresql" service is active on "server"                                                # features/step_definitions/command_steps.rb:240
    And I wait for "60" seconds                                                                                # features/step_definitions/common_steps.rb:19
Initializing a twopence node for 'sle_minion'.
Host 'sle_minion' is alive with determined hostname dominik-min-sles and FQDN dominik-min-sles.mgr.suse.de
Node: dominik-min-sles, OS Version: 15-SP4, Family: sles
    And I am on the Systems overview page of this "sle_minion"                                                 # features/step_definitions/navigation_steps.rb:437
    And I follow "States" in the content area                                                                  # features/step_definitions/navigation_steps.rb:326
    And I click on "Apply Highstate"                                                                           # features/step_definitions/navigation_steps.rb:283
    Then I should see a "Applying the highstate has been scheduled." text                                      # features/step_definitions/navigation_steps.rb:613
tail -n 10 /var/log/rhn/rhn_taskomatic_daemon.log | grep 'MinionActionExecutor'  returned status code = 0.
Output:
'2023-10-26 15:14:14,090 [DefaultQuartzScheduler_Worker-6] INFO  com.redhat.rhn.taskomatic.task.MinionActionExecutor - Executing action: 149
'
    And I wait until I see "MinionActionExecutor" in file "/var/log/rhn/rhn_taskomatic_daemon.log" on "server" # features/step_definitions/command_steps.rb:1626
    And I wait until event "Apply highstate scheduled by admin" is completed                                   # features/step_definitions/common_steps.rb:135
      This scenario took: 114 seconds

1 scenario (1 passed)
10 steps (10 passed)
1m54.133s

nodeg avatar Oct 26 '23 13:10 nodeg

Any suggestions or comments, @srbarrios?

Sorry took me a while to jump here. Yeah, so I proposed to monitor the taskomatic log at another point of time in fact. SO, something like:

  Scenario: Restart the PostgreSQL database service
    Given I am authorized for the "Admin" section
    When I restart the "postgresql.service" service on "server"
    And I wait until "postgresql" service is active on "server"
    And I wait until I see "xxxxx" in file "/var/log/rhn/rhn_taskomatic_daemon.log" on "server"
    And I am on the Systems overview page of this "sle_minion"
    And I follow "States" in the content area
    And I click on "Apply Highstate"
    Then I should see a "Applying the highstate has been scheduled." text
    And I wait until event "Apply highstate scheduled by admin" is completed

I tried to identify the text we need to look for by navigating into our product code, looking for tasko, quartz and postgresql words, but I couldn't find it. I would say we even need to add a new log to the Java side in order to properly monitor this reconnection from the database.

But... on python code, I found this debug log entry that can help with this monitoring:

https://github.com/uyuni-project/uyuni/blob/master/susemanager-utils/susemanager-sls/modules/pillar/suma_minion.py#L125

My guess is that's printed on salt-master logs

srbarrios avatar Oct 26 '23 14:10 srbarrios

/var/log/salt/master also does not contain anything useful, we only the error messages.

dominik-srv:/ # tail /var/log/salt/master
2023-10-26 14:14:00,525 [salt.loaded.ext.engines.mgr_events:147 ][ERROR   ][27280] salt.loaded.ext.engines.mgr_events: SSL connection has been closed unexpectedly

2023-10-26 14:14:00,525 [salt.loaded.ext.engines.mgr_events:151 ][ERROR   ][27280] salt.loaded.ext.engines.mgr_events: Error commiting: connection already closed
2023-10-26 14:14:00,526 [salt.loaded.ext.engines.mgr_events:188 ][ERROR   ][27280] salt.loaded.ext.engines.mgr_events: Diconnected from database. Trying to reconnect...
2023-10-26 14:14:00,773 [salt.loaded.ext.pillar.suma_minion:150 ][ERROR   ][27293] Error on getting database pillar, trying again: ('SSL connection has been closed unexpectedly\n',)
2023-10-26 15:14:14,467 [salt.loaded.ext.engines.mgr_events:147 ][ERROR   ][27280] salt.loaded.ext.engines.mgr_events: SSL connection has been closed unexpectedly

2023-10-26 15:14:14,467 [salt.loaded.ext.engines.mgr_events:151 ][ERROR   ][27280] salt.loaded.ext.engines.mgr_events: Error commiting: connection already closed
2023-10-26 15:14:14,469 [salt.loaded.ext.engines.mgr_events:188 ][ERROR   ][27280] salt.loaded.ext.engines.mgr_events: Diconnected from database. Trying to reconnect...
2023-10-26 15:14:14,717 [salt.loaded.ext.pillar.suma_minion:150 ][ERROR   ][27285] Error on getting database pillar, trying again: ('SSL connection has been closed unexpectedly\n',)

We concluded in Slack that there is no easy solution for this right now, so I added the initial waiting time from Vladimir back but also introduced a step to check for the triggered high state event in the Taskomatik log file.

nodeg avatar Oct 27 '23 09:10 nodeg

We need to disable this test for now, according to mc. It did cause too much issues in 4.3 where it already was merged. I will open a bug report and adjust the PR to comment the test out in the .yml file.

nodeg avatar Jan 18 '24 12:01 nodeg

@raulillo82 This can also be merged at some point to align the branches with 4.3, too. It is not exactly a test suite fix, though. I did rebase the branch again.

nodeg avatar Jan 18 '24 12:01 nodeg

Rebase done. Once the tests are finished, I will merge the PR.

nodeg avatar Mar 05 '24 13:03 nodeg