firefly icon indicating copy to clipboard operation
firefly copied to clipboard

Migration to v1.3.0 can cause missed events when contract listeners are listening from latest

Open EnriqueL8 opened this issue 1 year ago • 13 comments

As part of the v1.3.0 release, we rearchitected the way event streams from the transaction managers are configured against the namespaces in FireFly. FireFly core will now create one event stream per namespace and as such we wrote code to migrate the existing contract listeners on the previous event stream to the corresponding namespace event stream.

Contract listeners have an option of where to listen from where the choices are accepted:

  • oldest which defaults to the first block
  • a specific block number
  • latest which will be start listen from the creation of the listener.

As such when the listeners get recreated as part of the migration, the ones set to latest will now listen from a new latest and in the time where the chain has advance and FireFly is starting up some event could get lost! This can get worse with another bug where if FireFly cannot communicate with the Transaction Manager to create the listener it will give up and never listen to those events!

A way to solve this would be:

  • Retrieve the previous contract listeners created against the default eventstream
  • Find the checkpoint value - i.e the block at which this listener has caught up to!
  • When creating the new contract listener against the namespace listener set the option to listen from the checkpoint!

This was we do not lose events.

Looking at the code, this change is tricker than it sounds because the original change simply at startup creates an event stream and apply the listeners created for the namespace. There is no notion of "migration" so this code will need to introduce some code to get the existing listeners on the "global" event stream and hopefully be able to using some identifier correlate them to the new ones that need to be created.

EnriqueL8 avatar Jun 20 '24 20:06 EnriqueL8

Concrete proposal - which has other benefits:

  • [x] For FFTM based connectors (EVMConnect) we do the work to ensure FireFly Core and query the latest checkpoint for a listener
  • [ ] We update the migration code to ask the Go blockchain plugin if it can get the checkpoint for any migration of listeners, and do one of these - in order:
    • ~Use the checkpoint if it can be obtained (EVMConnect only)~
      • see below - unclear this is optimization is feasible, and does not add anything over the next item
    • [ ] Look up the most recent event dispatched by the connector, and if one is found then require the blockchain plugin to decode the protocolId back into a fromBlock (all connectors - EthConnect/Fabric)
    • [ ] If no event, then fall back to the original approach

peterbroadhurst avatar Jun 21 '24 13:06 peterbroadhurst

The listener endpoint on FFTM already has the checkpoint information embedded, so it doesn't look like changes would be needed to the FFTM or EVMConnect codebases for Core to be able to query the checkpoint information:

https://github.com/hyperledger/firefly-transaction-manager/blob/1ac00b5eb37664d73abebd8822ae8439cef3f4bc/pkg/fftm/stream_management.go#L376-L385

peterbroadhurst avatar Jun 21 '24 20:06 peterbroadhurst

This code confirms existence of custom listeners. Per the comment it's not specific to the migration scenario: https://github.com/hyperledger/firefly/blob/16aec0b55471c2950e2d95ff90972ca0455c369d/internal/contracts/manager.go#L156-L161

This calls down to ethereum specific code, which gets the subscription in the context of the e.streamID that is designated to the namespace: https://github.com/hyperledger/firefly/blob/16aec0b55471c2950e2d95ff90972ca0455c369d/internal/blockchain/ethereum/ethereum.go#L909-L913

So from this I believe that the code for migration is rather leaning on simply the recreation code that was introduced (during the 1.2->1.3 development) for a different reason - the recreation of listeners in EVMConnect if they had been deleted.

Need to dig into what happens to the old listeners that were at a global level 👁️

peterbroadhurst avatar Jun 23 '24 18:06 peterbroadhurst

Reading through https://github.com/hyperledger/firefly/pull/1388 and the code, it seems like the original event streams are just left and ignored. However, they are not referred to in any way currently. So implementing a read of their checkpoint would be complicated.

So I'm going to investigate locating/decoding/passing the last-event information.

peterbroadhurst avatar Jun 23 '24 19:06 peterbroadhurst

For MultiParty networks (where used) the FireFly batch pin contract follows the same pattern, and there's even less code change as the namespace was included in the listener name already in 1.2: https://github.com/hyperledger/firefly/blob/16aec0b55471c2950e2d95ff90972ca0455c369d/internal/blockchain/ethereum/ethereum.go#L285-L289

peterbroadhurst avatar Jun 23 '24 19:06 peterbroadhurst

The thing that distinguishes multi-party BatchPin events from all other events in the blockchainevents table, is the lack of a listener_id.

peterbroadhurst avatar Jun 23 '24 19:06 peterbroadhurst

For Tokens, there is a strong leaning to using either 0 or a specific block number in the code. When a user doesn't specify a block number, it goes to 0 in the token connector before being passed to EVMConnect. There are many cases we see specific block numbers being used to designate the contract installation block for a token. However, currently I don't see any likelihood that token connectors would have been coerced with a blockNumber: "latest".

So I am not going to work on addressing that, as it would mean a change to the /activatepool flow (which I believe gets now called on every re-connect of a namespace to a token connector, against every pool) to look up the last event and provide it as extra information to handle only the latest special case.

I will look to discuss this a little further with @awrichar

peterbroadhurst avatar Jun 23 '24 19:06 peterbroadhurst

@EnriqueL8 @awrichar - I have submitted PR https://github.com/hyperledger/firefly/pull/1534

Open conversations I'd like your view on this week:

  • This only addresses the newest/latest case, it does not address the inefficiency of replaying all transactions since block 0 (or another specific block)
    • Should we extend the change to always use this logic if the lastProtocolID block is greater-than the registration block?
  • Token connectors:
    • Question outlined in previous comment

peterbroadhurst avatar Jun 24 '24 17:06 peterbroadhurst

Thanks for all the thinking!

  • I'm happy with not addressing this for the token connectors based on the above points.
  • It would be create to solve the inefficiency as catching up from large chain as part of the upgrade can take a long time. Wouldn't it almost always be the case that the lastProtocolID block is greater-than the registration block?

EnriqueL8 avatar Jun 25 '24 11:06 EnriqueL8

So just thinking about the implications, it will simply be that the listener in EVMConnect/fabconnect will show a fromBlock that isn't the only you specified on the listener itself.

I think that's justifiable for the benefit you get - should roll up to the release notes for 1.3.1 for sure, and maybe we need a tweak to the reference section that talks about the event bus, to cover this.

Happy to take both those items on.

peterbroadhurst avatar Jun 25 '24 17:06 peterbroadhurst

I've updated the PR to handle the 0 and explicit block cases for Fabric and Ethereum

peterbroadhurst avatar Jun 27 '24 01:06 peterbroadhurst

Just added the documentation tag to make sure we document this new behaviour and add it to the release notes

EnriqueL8 avatar Jun 27 '24 10:06 EnriqueL8

So just thinking about the implications, it will simply be that the listener in EVMConnect/fabconnect will show a fromBlock that isn't the only you specified on the listener itself.

Agreed, we should just document that as part of upgrade your listeners will get recreated and as part of this it will take the new "latest" block so it does not loose events. We query the checkpoint information from the connector when retrieving a listener so users will see it straight away

EnriqueL8 avatar Jun 27 '24 10:06 EnriqueL8

This has been merged!

EnriqueL8 avatar Jul 11 '24 15:07 EnriqueL8