xcp icon indicating copy to clipboard operation
xcp copied to clipboard

Stale lock files prevent the updater plugin from being used by XO

Open stormi opened this issue 3 years ago • 0 comments

(creating the specification that I should have written before)

Some operations of the updater plugin fail because a stale lock file (due to a reboot during an operation for example) makes it think there's already an instance of it running.

We need to fix this in XCP-ng 8.2 as well as in the development branch.

  • For the development branch:
    • Fix to the master branch of https://github.com/xcp-ng/xcp-ng-xapi-plugins. PR: https://github.com/xcp-ng/xcp-ng-xapi-plugins/pull/30
    • Unit tests (<= probably this) or xcp-ng-tests covering the changed code, depending on what's the most appropriate. We don't have any tests yet that check what happens around this locking mechanism in the updater plugin. The tests would have to be run on the old code (some tests would failed - the bugs, others would pass) and then on the new code (all tests pass) and added to the test suite.
    • Manual tests by updating the plugins manually on XCP-ng 8.2
  • For XCP-ng 8.2
    • Backport of the fix to the maintenance branch: https://github.com/xcp-ng/xcp-ng-xapi-plugins/tree/1.6.1-8.2
    • New release of the RPM in the testing repository. The fix will be added as a patch to https://github.com/xcp-ng-rpms/xcp-ng-xapi-plugins/tree/8.2. I can handle it if necessary.
    • Manual QA tests using the built RPM (XCP-ng team) + user tests on forum
    • Run CI tests for all plugins (the tests currently just check that the plugins answer requests)
    • Publish and announce the update

Regarding the unit tests, I believe they should at least cover the variations of the following input parameters:

  • lock status: no lock file / exists and locked / exists and not locked (stale)
  • if lock file exists: is empty / contains the current operation
  • current operation: check_update / update
  • attempted operation: check_update / update
  • when there's a timeout: one case where the timeout expires / one case where the timeout is not reached

This specification can be challenged, as usual.

CC @nraynaud @Wescoeur @benjamreis

stormi avatar Nov 05 '21 16:11 stormi