firefly-ethconnect icon indicating copy to clipboard operation
firefly-ethconnect copied to clipboard

Missing events when performing catchup with `catchupModePageSize` > `catchupModeBlockGap`

Open peterbroadhurst opened this issue 3 years ago • 4 comments

This is problematic for FireFly, which subscribes from Block 0, and uses a PATCH on startup to ensure the configuration of the event stream is always up to date.

[2022-02-04T02:57:47.779Z]  INFO --> PATCH /eventstreams/es-2259ebbd-0f87-419d-4463-29876cde5be6
[2022-02-04T02:57:47.780Z]  INFO es-2259ebbd-0f87-419d-4463-29876cde5be6: Update event stream
[2022-02-04T02:57:47.780Z]  INFO es-2259ebbd-0f87-419d-4463-29876cde5be6: Notified of an ongoing stream update, existing batch processor
[2022-02-04T02:57:47.780Z]  INFO es-2259ebbd-0f87-419d-4463-29876cde5be6: Notified of an ongoing stream update, existing event poller
[2022-02-04T02:57:47.780Z]  INFO es-2259ebbd-0f87-419d-4463-29876cde5be6: Notified of an ongoing stream update, not waiting for new events
[2022-02-04T02:57:47.800Z]  INFO sb-e1f61740-2dc7-4e1f-68ad-f0ac55e4e663:BatchPin(address,uint256,string,bytes32,bytes32,string,bytes32[]): Uninstalled filter. ok=true (%!s(<nil>))
[2022-02-04T02:57:47.800Z]  INFO <-- PATCH /eventstreams/es-2259ebbd-0f87-419d-4463-29876cde5be6 [200]
[2022-02-04T02:57:47.801Z]  INFO sb-e1f61740-2dc7-4e1f-68ad-f0ac55e4e663:BatchPin(address,uint256,string,bytes32,bytes32,string,bytes32[]): initial block height for event stream (from block): 0
[2022-02-04T02:57:47.807Z]  INFO --> GET /subscriptions
[2022-02-04T02:57:47.807Z]  INFO <-- GET /subscriptions [200]
[2022-02-04T02:57:47.819Z]  INFO sb-e1f61740-2dc7-4e1f-68ad-f0ac55e4e663:BatchPin(address,uint256,string,bytes32,bytes32,string,bytes32[]): catchup mode. Blocks 0 -> 4999
[2022-02-04T02:57:47.822Z]  INFO WS/8d27544e-a7ce-460d-6442-0621118b27dc: Connected
[2022-02-04T02:57:47.891Z]  INFO WS/16721966-36b1-45bc-48ff-4816badd5556: Connected
[2022-02-04T02:57:48.850Z]  INFO sb-e1f61740-2dc7-4e1f-68ad-f0ac55e4e663:BatchPin(address,uint256,string,bytes32,bytes32,string,bytes32[]): initial block height for event stream (from block): 0
[2022-02-04T02:57:48.872Z]  INFO sb-e1f61740-2dc7-4e1f-68ad-f0ac55e4e663:BatchPin(address,uint256,string,bytes32,bytes32,string,bytes32[]): catchup mode. Blocks 5000 -> 9999
[2022-02-04T02:57:49.911Z]  INFO sb-e1f61740-2dc7-4e1f-68ad-f0ac55e4e663:BatchPin(address,uint256,string,bytes32,bytes32,string,bytes32[]): initial block height for event stream (from block): 0

Then once it catches up, something weird happens after the first batch:

[2022-02-04T03:05:50.176Z]  INFO sb-e1f61740-2dc7-4e1f-68ad-f0ac55e4e663:BatchPin(address,uint256,string,bytes32,bytes32,string,bytes32[]): catchup mode. Blocks 435000 -> 439999
[2022-02-04T03:05:50.226Z]  INFO sb-e1f61740-2dc7-4e1f-68ad-f0ac55e4e663:BatchPin(address,uint256,string,bytes32,bytes32,string,bytes32[]): Dispatching event. Address=0x4D2752586bff01E1F598B71FA56e9AC8A11582D8 BlockNumber=436683 TxIndex=0x0
[2022-02-04T03:05:50.226Z]  INFO sb-e1f61740-2dc7-4e1f-68ad-f0ac55e4e663:BatchPin(address,uint256,string,bytes32,bytes32,string,bytes32[]): Dispatching event. Address=0x4D2752586bff01E1F598B71FA56e9AC8A11582D8 BlockNumber=436721 TxIndex=0x0
[2022-02-04T03:05:50.226Z]  INFO es-2259ebbd-0f87-419d-4463-29876cde5be6: New batch length 1
[2022-02-04T03:05:50.226Z]  INFO sb-e1f61740-2dc7-4e1f-68ad-f0ac55e4e663:BatchPin(address,uint256,string,bytes32,bytes32,string,bytes32[]): Dispatching event. Address=0x4D2752586bff01E1F598B71FA56e9AC8A11582D8 BlockNumber=436764 TxIndex=0x0
[2022-02-04T03:05:50.226Z]  INFO sb-e1f61740-2dc7-4e1f-68ad-f0ac55e4e663:BatchPin(address,uint256,string,bytes32,bytes32,string,bytes32[]): Dispatching event. Address=0x4D2752586bff01E1F598B71FA56e9AC8A11582D8 BlockNumber=436835 TxIndex=0x0
[2022-02-04T03:05:50.226Z]  INFO sb-e1f61740-2dc7-4e1f-68ad-f0ac55e4e663:BatchPin(address,uint256,string,bytes32,bytes32,string,bytes32[]): Dispatching event. Address=0x4D2752586bff01E1F598B71FA56e9AC8A11582D8 BlockNumber=436916 TxIndex=0x0
[2022-02-04T03:05:50.226Z]  INFO sb-e1f61740-2dc7-4e1f-68ad-f0ac55e4e663:BatchPin(address,uint256,string,bytes32,bytes32,string,bytes32[]): Dispatching event. Address=0x4D2752586bff01E1F598B71FA56e9AC8A11582D8 BlockNumber=437041 TxIndex=0x0
[2022-02-04T03:05:50.727Z]  INFO es-2259ebbd-0f87-419d-4463-29876cde5be6: Batch 4 initiated with 6 events. FirstBlock=436683 LastBlock=437041
[2022-02-04T03:05:50.727Z]  WARN Cleared out suprious ack (could be from previous disonnect). err=FFEC100205: WebSocket '8d27544e-a7ce-460d-6442-0621118b27dc' closed
[2022-02-04T03:05:51.154Z]  INFO Attempt batch 4 complete. ok=true
[2022-02-04T03:05:51.227Z]  INFO sb-e1f61740-2dc7-4e1f-68ad-f0ac55e4e663:BatchPin(address,uint256,string,bytes32,bytes32,string,bytes32[]): initial block height for event stream (from block): 0
[2022-02-04T03:05:51.263Z]  INFO sb-e1f61740-2dc7-4e1f-68ad-f0ac55e4e663:BatchPin(address,uint256,string,bytes32,bytes32,string,bytes32[]): created filter from block 440000: 0xdde5b88258480118d701b810b2534400 - {Addresses:[0x4D2752586bff01E1F598B71FA56e9AC8A11582D8] Topics:[[0x805721bc246bccc732581be0c0aa2dd8f7ec93e97ba4b307be84428c98b0a12f]]}

See the second " initial block height for event stream (from block): 0" and then no further events.

There was an event it should have hit at block 437521 - which is after the 437041 event it dispatched as part of the gap fill, but before the 440000 where it set the filter.

peterbroadhurst avatar Feb 04 '22 03:02 peterbroadhurst

This is affect v0.12.0 of FireFly, which shipped this change: https://github.com/hyperledger/firefly/commit/c07090310dffe024aa5ca193e51533df0988e027

peterbroadhurst avatar Feb 04 '22 03:02 peterbroadhurst

It seems to be the case that as well as resetting, the event stream does not receive events after the initial batch that are found during the rewind.

peterbroadhurst avatar Feb 04 '22 03:02 peterbroadhurst

Ok - found the root cause here.

It's subtle:

  • The catchupModePageSize was set to be GREATER THAN the catchupModeBlockGap
  • This meant our logic at the point we caught up missed some events
  • ~It also seems to have meant we didn't write a checkpoint, which is why we are resetting to 0 - so the original subject of this issue is probably erroneous~

peterbroadhurst avatar Feb 04 '22 04:02 peterbroadhurst

Note I found we did still have a problem with writing incremental HWM checkpoint while doing gap based recovery: https://github.com/hyperledger/firefly-ethconnect/pull/199#issuecomment-1029662339

peterbroadhurst avatar Feb 04 '22 05:02 peterbroadhurst