firefly icon indicating copy to clipboard operation
firefly copied to clipboard

Close FIR-17

Open nguyer opened this issue 1 year ago • 3 comments
trafficstars

This issue is to track the work required to close out FIR-17. Specifically we've identified that we may need to create an API to allow someone to recover a specific topic if we've spent a nextpin on a transaction that will never succeed. This would currently block the topic forever after that point.

nguyer avatar Jan 10 '24 15:01 nguyer

The main blocker I'm aware of right now can be summarized as follows:

https://github.com/hyperledger/firefly/blob/fd542c0b1dd74dd9d7d009f62aa7f601a136d1fa/internal/multiparty/manager.go#L248-L257

This code is invoked at the point that we are in the batch dispatching loop, meaning the batch has been assembled and sealed, and has been assigned an order in its ordering context. The code is built to assume that any error returned from SubmitBatchPin is retryable. This was reasonably sane to assume when dealing with the simple BatchPin contract. However, with the potential now to route this request through a custom smart contract, we've introduced a lot more ways that this could fail in a way that really is not retryable, but will keep retrying forever anyway and block the topic.

I don't think we can fully prevent getting into this situation. We could add guardrails, but I think the more important side of the issue is to provide a way to recover from a stuck batch dispatcher (ie one that is retrying forever on a batch that will continue to fail).

Recovery probably can take two directions:

  • recover "backwards" by unwinding batch dispatch and assembly to remove this batch and reassign ordering of anything behind it in line (probably difficult)
  • recover "forwards" by marking this batch bad and somehow dispatching it (or dispatching a placeholder) that will allow everyone to fill the gap and move forward (probably easier, but could have implications if you queued up things that need to be ordered afterward)

awrichar avatar Feb 28 '24 19:02 awrichar

Note that the problem is smaller for broadcast pinned messages, which all use an identical topic. It would be enough to simply "cancel" the batch and its messages, which would unblock things.

For private pinned messages, you have the problem of the nonce. https://github.com/hyperledger/firefly/blob/c36ccb652c6a6008b3df7d07516afee831ace39b/internal/batch/batch_manager.go#L487

awrichar avatar Feb 28 '24 20:02 awrichar

I think each batch dispatcher can only have one sealed batch in its queue at a time, but I need to confirm.

https://github.com/hyperledger/firefly/blob/c36ccb652c6a6008b3df7d07516afee831ace39b/internal/batch/batch_processor.go#L364

The first thing to check is how difficult it would be to unwind/unseal a batch that is stuck in the dispatch phase. My gut reaction was "pretty difficult", but I'd like to have a specific answer to this question before proposing solutions.

awrichar avatar Feb 28 '24 20:02 awrichar

Further details on the steps needed to recover from a stuck batch in either direction:

Fail backward This means canceling a batch of pinned private messages by "unsealing" it and restoring nonces to what they were before the batch was assembled.

The main problem is that the "nonces" table in the database has been updated to set the next available nonce for each context. Restoring state would require determining all nonces that were updated by messages in this batch, and what was the earliest nonce. We would then have to update the appropriate row(s) in the "nonces" table to restore them. There's an additional wrinkle because nonce 0 is only used when the row did not exist - so if the first message on the context is canceled, that nonce row would need to be deleted. This is all pretty messy to achieve (though maybe not impossible). The one positive is that the dispatcher does block between sealing and dispatching - so the next batch in line won't start grabbing nonces as long as the current batch is stuck in dispatching.

Fail forward This means dispatching the stuck message on its own, or a placeholder message using the same nonce, instead of with the custom blockchain call.

This would probably take the form of a special flag on the batch or operation, to prevent it from taking the special path to invoke the custom contract. Instead, a normal batch pin could be performed.

We would probably want to replace the message with a placeholder message, or flag it so that all members will treat it as "rejected" or similar (not confirmed).

Other considerations In either case, note that the batch dispatcher runs a tight retry loop. Any API call to cancel a batch would need a way to signal into this loop and make it stop retrying.

Also, we might want to provide a way to clear out messages that are in the "ready" state and have not yet been assembled into a batch (ie messages that are queued behind the stuck batch). This is separate from the problem of unsticking the batch itself. These messages won't have a batch or nonce assigned to them, but they may already be queued in a Go channel for batch assembly - so locating them and preventing them from getting dispatched could be a challenge.

If we do not provide this, we risk 1) a bunch of messages getting confirmed in an order that the application does not expect, after a message in front of them failed, or 2) each of these messages getting stuck in the same way as the first one, and having to be unstuck one by one.

awrichar avatar Mar 28 '24 20:03 awrichar

Thanks for continuing to move this forward @awrichar!

On the Fail Forward case, does just skipping the custom blockchain call solve every case where we could get stuck with this? I think if I recall correctly we had discussed the fact that even before FIR-17 it may be possible to block a topic permanently for other reasons (though I can't recall which specifically at the moment). Maybe Fail Forward is still a possible valid answer. But maybe it needs to do a bit more than just skip that one specific call?

nguyer avatar Mar 29 '24 13:03 nguyer

@nguyer you're correct that the fail forward (as written above) wouldn't solve every case - it would just reduce the exposure down to reasons that the original batch pin can fail (which I believe is limited to things like having deployed a bad FireFly contract, or having not actually deployed a contract to the address in the config at all).

awrichar avatar Mar 29 '24 14:03 awrichar

Couple of points from a offline huddle with @nguyer:

  • The batch processor that assembles "custom contract" private pins is separate from the one that assembles "normal" private pins - but they draw from the same sequence of nonces. This means that we may have successfully assigned and sent later nonces than the stuck batch. This makes "fail backward" considerably more difficult unless we do a lot more rearchitecture on how batch processors are split.
  • Therefore, "fail forward" (aka "gap fill") seems like the correct approach to explore first.
  • Unsticking the dispatching batch is the priority - any further enhancements to clearing out messages still in the assembly queue is totally separate (and may not make the cut for this first fix).

awrichar avatar Mar 29 '24 18:03 awrichar

Another important detail: for private messages, the recipients will all have received the message via DX and marked it "pending". So a gap fill does seem to be the right move, in order to allow them to know the message won't get confirmed.

awrichar avatar Mar 29 '24 21:03 awrichar

Skeleton proposal:

  • New API /batches/{batch}/cancel
    • Only valid if batch tx type is contract_invoke_pin
    • Only valid if all messages in batch are in state ready (note that for this tx type, the batch will always be of size 1)
  • All messages in batch will be moved to a new state of cancelled by the sender
  • New messages will be created which are mostly copies of the previous messages (including any assigned pins), except:
    • new ID
    • CID referencing the original message
    • tx type of batch_pin
    • special tag to indicate gap fill (naming TBD)
    • no data payload
  • Batch dispatcher will (somehow) be notified to stop dispatching the cancelled batch and move on
  • New "gap fill" messages will be picked up by the other "normal" batch processor and sent out via the regular batch pin contract
  • All recipients will receive the gap fill and will mark the original message cancelled

awrichar avatar Mar 29 '24 21:03 awrichar