Pool upgrades retry indefinitely when unsuccessful

Open WadeBarnes opened this issue 5 years ago • 0 comments

When a pool upgrade fails on a particular node, the upgrade process is rescheduled and restarted on the node following the timeout period specified in the pool-upgrade command used to schedule the initial upgrade. If the upgrade process is continuously unsuccessful the upgrade is rescheduled on the node and retried indefinitely until an upgrade cancel command is issued to stop the process.

This was observed during a network upgrade over the last two days. The upgrade on one node failed on Friday. That upgrade was canceled, the node was restarted and a targeted upgrade was scheduled on that node for 2020-08-28T21:00:00-00:00 with a longer timeout. That upgrade also failed, so we'll be investigating that issue with the Steward further. However, a cancel command was not issued right away and the upgrade process continued to retry until the cancel command was finally issued this morning.

Here is the upgrade log from the node:

"upgrade_log": [
  "2020-08-30 12:00:10.052972\tscheduled\t2020-08-29 14:00:00+00:00\t1.1.89\t1598644251746331800\tsovrin\n",
  "2020-08-30 12:00:10.053393\tstarted\t2020-08-29 14:00:00+00:00\t1.1.89\t1598644251746331800\tsovrin\n",
  "2020-08-30 12:30:10.278094\tscheduled\t2020-08-29 14:00:00+00:00\t1.1.89\t1598644251746331800\tsovrin\n",
  "2020-08-30 12:30:10.278418\tstarted\t2020-08-29 14:00:00+00:00\t1.1.89\t1598644251746331800\tsovrin\n",
  "2020-08-30 13:00:10.491240\tscheduled\t2020-08-29 14:00:00+00:00\t1.1.89\t1598644251746331800\tsovrin\n",
  "2020-08-30 13:00:10.491587\tstarted\t2020-08-29 14:00:00+00:00\t1.1.89\t1598644251746331800\tsovrin\n",
  "2020-08-30 13:30:10.743418\tscheduled\t2020-08-29 14:00:00+00:00\t1.1.89\t1598644251746331800\tsovrin\n",
  "2020-08-30 13:30:10.743969\tstarted\t2020-08-29 14:00:00+00:00\t1.1.89\t1598644251746331800\tsovrin\n",
  "2020-08-30 14:00:10.965996\tscheduled\t2020-08-29 14:00:00+00:00\t1.1.89\t1598644251746331800\tsovrin\n",
  "2020-08-30 14:00:10.966318\tstarted\t2020-08-29 14:00:00+00:00\t1.1.89\t1598644251746331800\tsovrin\n"

The cancel command was issued before the upgrade started at 14:00:10+00:00 would have timed out. The log was taken at 14:45:00+00:00. The upgrade did not get rescheduled, however it does not show the upgrade was canceled, or failed. On Friday after the upgrade was canceled and the node was restarted the upgrade log indicated the upgrade had failed (but not before being restarted).

The upgrade process should automatically stop after a number of failed retry attempts (say 3 by default).

The node continued to operate and perform normally during this period; it continued to participate and stayed in consensus.

Aug 30 '20 14:08 WadeBarnes