Pool Upgrade tried to restart, 1 year later

Open lynnbendixsen opened this issue 2 years ago • 0 comments

Looks like while in the process of upgrading to 20.04 an old "pool_upgrade" command reactivated and some nodes started writing NODE_UPGRADE txns to the config ledger, with "in process" status. See https://indyscan.indiciotech.io/txs/IND_DEMONET/config. for the output in indy_scan of the events.

Sequence of events that resulted in this "issue" being reported: Sep 2, 2022 18:07:13 -> pool_upgrade command sent for entire Network 2 hours later 3 nodes still hadn't upgraded, not sure why. So... Sep 2, 2022 20:26:36 -> pool_upgrade command sent for the 3 nodes that didn't upgrade Sep 2, 2022 20:41:03 -> pool upgrade command completes with the last node of the three reporting "complete" for the upgrade No other indication that anything has gone wrong happened until the first of these three nodes was started back up as a newly installed 20.04 node. That node registered that an upgrade was needed based on the commands sent a year previously (no new txn written to the ledger for "pool_upgrade" but it began writing txn's every 15 minutes stating that a node_upgrade was "in process") The logs show a repeated occurrence of the following sequence: upgrader.py: found upgrade START txn upgrader.py: Node 'My_Node' handles upgrade txn upgrader.py: Node 'My_Node' schedules upgrade to 1.1.97 upgrader.py: ...zsRy's upgrader processing upgrade for version sovrin=1.1.97 upgrader.py: ...ezsRy's upgrader calling agent for upgrade node.py: My_Node is about to be upgraded, sending NODE_UPGRADE in_progress to version 1.1.97 upgrader.py: Sending message to control tool: {"message_type": "upgrade", "version": "1.1.97", "pkg_name": "sovrin"} upgrader.py: Waiting 15 minutes for upgrade to be performed upgrader.py: upgrader.py: upgrader.py: Timeout exceeded for 2022-09-02 upgrader.py: Node My_Node failed upgrade 1662150396107164000 to version 1.1.97 of package sovrin scheduled on 2022-09-02 ... because of exceeded upgrade timeout Then immediately repeats (in the logs) upgrader.py: found upgrade START txn ...

I suggest that we research the proper "fail" command to return back to the node from the controller so that it writes the "fail" to the ledger properly and cleans things up. OR honor the timeout by writing a fail to the ledger after the timeout instead of simply trying again after timeout...

Nov 01 '23 18:11 lynnbendixsen