fabric Evicting an orderer while it is down, adding it again, and restarting it, w/o system channel, creates endless loop

The problem:

A 3-node cluster is started, say o1, o2, o3. (Raft IDs 1,2,3).
Orderer o1 is shutdown.
Orderer o1 is evicted, that is, removed from the consenter set, by a config tx to o2, o3.
Orderer o1 is re-added to the consenter set, with same certificate and the same endpoint.
Repeat steps 3,4 (remove, add)
Orderer o1 is restarted.

The o1 will now be known by the cluster (o2,o3) as having Raft ID = 4. The consenter o1 will detect it has a stale RaftID (has 1, everyone else already know it as 4) and "cowardly halt", which will trigger the follower chain. This happens because other orderers send raft messages to the same o1 endpoint, but with target Raft ID=4.

The follower chain of o1 will detect using the last block (genesis) that it is still a consenter, and will transfer control to the consenter, which creates a never ending loop.

See integration test integration/raft/config_test.go at use cases "an orderer node is evicted", "an evicted node is added back while it's offline"

Workaround: Currently, the only way to stop this behavior is to remove the channel from o1 using the channel participation API, and re-join it to the channel.

Related problems Another problem is that the channel participation "Remove" does not clear the old etcdraft WAL & snapshots that were used by raft-ID=1, and restarting an etcdraft with ID=4 on a WAL/snapshot created in the past by ID=1 may create problems. This may work, but is highly unrecommended, as a new raft node with a new ID should start fresh without a WAL.

Solution Several options:

When the consenter detects it needs to halt because of a stale ID, it could halt without triggering a transition to a follower, to a "stopped" state; let the admin rejoin the orderer with a more advanced config block.
When the consenter detects it needs to halt because of a stale ID, it could halt and signal the follower to "look ahead" for a config block beyond the current one
Add a channel participation API to "stop" a channel w/o removing the ledger, and then "re-join".

This is a low priority fix because there is a workaround, and it is a rare event that represents what is essentially an admin error.

Related issues #3515

Feb 05 '23 09:02 tock-ibm

Hello, I so glad to find this issue, this problem has a close relationship with our research, the security of Hyperledger fabric. But I am sorry, I can't grasp what your mean totally. More specifically, I can't rebuild this problem like you say. If you can tell me the steps in detail, I would be very grateful！Thank you！

Feb 15 '23 09:02 siexpence

Hello, I so glad to find this issue, this problem has a close relationship with our research, the security of Hyperledger fabric. But I am sorry, I can't grasp what your mean totally. More specifically, I can't rebuild this problem like you say. If you can tell me the steps in detail, I would be very grateful！Thank you！

@siexpence The problem shows up in this test: integration/raft/config_test.go at use cases "an orderer node is evicted", "an evicted node is added back while it's offline"

This line actually tests that the orderer is in a loop:

https://github.com/hyperledger/fabric/blob/bdb311b8c70d82b387b6bac2e8feaa7f8a486f71/integration/raft/config_test.go#L1314

removing the channel from the orderer solves the problem: https://github.com/hyperledger/fabric/blob/bdb311b8c70d82b387b6bac2e8feaa7f8a486f71/integration/raft/config_test.go#L1330

Mar 01 '23 08:03 tock-ibm

fabric fabric copied to clipboard

Evicting an orderer while it is down, adding it again, and restarting it, w/o system channel, creates endless loop

fabric
fabric copied to clipboard