firefly
firefly copied to clipboard
All events fail to process on a node if the source event's plugin configuration is missing
Steps to reproduce:
- Create FF node1 configured with the erc1155 tokens plugin
- Create a token pool / mint some tokens on FF node1
- Create and add FF node2 to the network configured without the erc1155 tokens plugin
- FF node2 fails to process the tokens events and gets stuck in an overall event processing rollback. It seems like this rollback prevents other non token events from processing and therefore prevents the org from registering itself in the network
Expected behavior: I would expect events received from a plugin that is not configured on a node to be ignored / handled appropriately until that plugin is configured on that node
logs snippet from FF node2:
[] INFO Node not yet registered pid=284
[] INFO Confirming system broadcast 'ff_define_node' [3760c6f0-da92-4610-8da6-31efb3edbf7f] dbtx=urGDLcz7 pid=284 role=aggregator
[] INFO ==> PUT http://localhost:4000/api/v1/peers/Kaleido-zzg7nwrkr3 breq=i2CSTj7B dx=https pid=284
[] INFO <== PUT http://localhost:4000/api/v1/peers/Kaleido-zzg7nwrkr3 [200] (104.76ms) breq=i2CSTj7B dx=https pid=284
[] INFO Emitting message_confirmed for message ff_system:3760c6f0-da92-4610-8da6-31efb3edbf7f dbtx=urGDLcz7 pid=284 role=aggregator
[] INFO Confirming system broadcast 'ff_define_pool' [34583289-1072-4e65-ba77-b9c9726a3a82] dbtx=urGDLcz7 pid=284 role=aggregator
[] ERROR Failed to activate token pool 'd4b913a5-5482-41e2-9100-493abbd724f1': FF10272: Unknown tokens plugin 'erc1155' dbtx=urGDLcz7 pid=284 role=aggregator
[] WARN SQL! transaction rollback dbtx=urGDLcz7 pid=284 role=aggregator
[] ERROR process events attempt 1: FF10272: Unknown tokens plugin 'erc1155' pid=284 role=ep[ff_system:ff_aggregator]
[] INFO Confirming system broadcast 'ff_define_node' [3760c6f0-da92-4610-8da6-31efb3edbf7f] dbtx=BxMYSRuO pid=284 role=aggregator
[] INFO ==> PUT http://localhost:4000/api/v1/peers/Kaleido-zzg7nwrkr3 breq=Y0MYmZ0_ dx=https pid=284
[] INFO <== PUT http://localhost:4000/api/v1/peers/Kaleido-zzg7nwrkr3 [200] (73.50ms) breq=Y0MYmZ0_ dx=https pid=284
[] INFO Emitting message_confirmed for message ff_system:3760c6f0-da92-4610-8da6-31efb3edbf7f dbtx=BxMYSRuO pid=284 role=aggregator
[] INFO Confirming system broadcast 'ff_define_pool' [34583289-1072-4e65-ba77-b9c9726a3a82] dbtx=BxMYSRuO pid=284 role=aggregator
[] ERROR Failed to activate token pool 'd4b913a5-5482-41e2-9100-493abbd724f1': FF10272: Unknown tokens plugin 'erc1155' dbtx=BxMYSRuO pid=284 role=aggregator
[] WARN SQL! transaction rollback dbtx=BxMYSRuO pid=284 role=aggregator
[] ERROR process events attempt 2: FF10272: Unknown tokens plugin 'erc1155' pid=284 role=ep[ff_system:ff_aggregator]
[] INFO Node not yet registered pid=284
.... (reattempt sequence repeats)
[] INFO <-- POST /api/v1/network/organizations/self [408] (120003.10ms): FF10260: The request with id '...' timed out after 119,298.88ms httpreq=lsvN1NrO pid=146 req=ho3P3xSP
@jebonfig - I've assigned this over to @awrichar for some deeper thinking, but this is a hard one to solve.
It is the model of FireFly that everybody processes broadcast events in the same order, in order to build the same shared state. So ignoring an event is a significant thing to do.
However, we do also have the concept of topics
that designate separate streams of messages that must be blocked until their messages are complete. In this case, I think it would be valid to consider the lack of a suitable plugin as a reason to consider a message such as this incomplete, and rewind to it when the plugin configuration is available. Then it would be just that one topic
that is blocked - rather than all broadcasts.
The problem is the complexity of detecting the adding of new token config as an event, and working out how to rewind to it (as I'm not sure there's any indexed field necessarily available to detect the situation). I'll leave @awrichar to consider the possibility, and cost vs. benefit of this.
It is (or should be) an explicit requirement for all nodes to have the exact same token config. Operating under any other state is considered a malformed configuration with potentially undefined behavior. But we could give some more thought over how to gracefully handle this error scenario...
A few notes for the record:
- Token pool definitions happen on the same topic as datatype definitions, of the form
ff_ns_{namespace}
. - The error in question occurs when a token pool definition is received. The token pool is written to the database in "pending" state, then the manager attempts to locate the proper plugin and activate the pool. When no plugin can be found, the handler errors out and rolls back the database transaction.
- This failure is flagged as
SystemBroadcastAction.ActionRetry
, so it will retry processing indefinitely and will not process any further events. - I assume (need to confirm) that adding the proper config and restarting the node would allow things to recover.
To flesh out Peter's suggestion above, I think these would be the needed steps:
- Move token pool definitions to a dedicated topic - I'd suggest one per plugin name, ie
ff_token_{plugin}
. - Consider this a non-fatal error in the event handler - return
SystemBroadcastAction.ActionWait
and probably allow the token pool to be written to the database in "pending" state despite this (or consider a distinct state other than "pending" here). - Find a way to re-process blocked token pool requests when the config changes. This implies one of two options: a. Cache all config and track changes between starts. b. On every start, scan for token pools in "pending" state, then check if their plugin name is found in the current config. Assume this may be a newly added config, so correlate from the token pool back to the message batch, and tell the aggregator to rewind and re-process that batch.