lnd icon indicating copy to clipboard operation
lnd copied to clipboard

mutil: introduce simple `BlockBeat` to fix itest flakes

Open yyforyongyu opened this issue 9 months ago • 9 comments

In an attempt to fix the itest flakes found in the sweeper/multi-hop tests, I realized there's no easy way to get around the block sync issue. In fact, I think these flakes already mean block sync may cause issues outside of the test context. Thus, a minimal version of BlockBeat is implemented.

Atm, when a new block is received, it's sent to the subsystems concurrently,

flowchart
    block((block))
    
    block ---> c[ChainArb]
    block ---> ca1[ChannelArb1]
    block ---> ca2[ChannelArb2]
		block ---> us[UtxoSweeper]
		block ---> b[Bumper]
    block ---> ar[AnchorResolver]
    block ---> cr[CommitResolver]
    block ---> hr[HtlcResolver...]

One system may process the block faster than another, causing a block sync issue. Blockbeat will make sure the blocks are sent in the following order,

flowchart
    block((block))
    
    block --> c[ChainArb]
    c --> ca1[ChannelArb1]
		c --> ca2[ChannelArb2]
		c --> ca3[ChannelArb...]
    ca1 ---> ar[AnchorResolver]
    ca1 ---> cr[CommitResolver]
    ca1 ---> hr[HtlcResolver...]
    ca2 ---> rs2[Resolvers...]
    ca3 ---> rs3[Resolvers...]
		ar ---> us[UtxoSweeper]
		cr ---> us[UtxoSweeper]
	  hr ---> us[UtxoSweeper]
	  rs2 ---> us[UtxoSweeper]
	  rs3 ---> us[UtxoSweeper]
		us ---> b[Bumper]

which means the following new behavior in the sweeper:

  1. no more waiting for the next block to trigger the sweep - when an input is sent to the sweeper at height X, the sweeping won't happen until block X+1, which is no longer the case.
  2. no need to immediately sweep inputs during restart - this hack can be removed now.
  3. no more different views of block heights among these subsystems - the fee func can now accurately calculate its width.

TODOs:

  • [ ] fix unit tests
  • [ ] add unit tests
  • [ ] fix itests -> new pr to limit the scope

yyforyongyu avatar May 02 '24 06:05 yyforyongyu

[!IMPORTANT]

Review skipped

Auto reviews are limited to specific labels.

Labels to auto review (1)
  • llm-review

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share
Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

coderabbitai[bot] avatar May 02 '24 06:05 coderabbitai[bot]

So the basic idea is to synchronize the entire contractcourt and sweeper on each block?

Do arrows in the above diagrams represent passing around the latest block info? It seems odd to have all the resolvers separately notifying the sweeper of the latest block.

Is it guaranteed that a BlockBeat fires on startup with the latest block info? I think this is required to eliminate the immediate sweep on startup logic.

morehouse avatar May 02 '24 22:05 morehouse

So the basic idea is to synchronize the entire contractcourt and sweeper on each block?

Yes.

Do arrows in the above diagrams represent passing around the latest block info? It seems odd to have all the resolvers separately notifying the sweeper of the latest block.

It represents the timeline of the block, maybe I should try a different graph, but the block is always sent by the BlockBeat.

Is it guaranteed that a BlockBeat fires on startup with the latest block info? I think this is required to eliminate the immediate sweep on startup logic.

Yes, when a new subscription is made via RegisterBlockEpochNtfn, the current block should always be sent.

yyforyongyu avatar May 03 '24 05:05 yyforyongyu

Concept ACK.

It represents the timeline of the block, maybe I should try a different graph, but the block is always sent by the BlockBeat.

Yeah, the diagram is a bit confusing. From the code it looks like ChainArbitrator passes block info down to channel arbitrators, which pass it to resolvers. But the sweeper gets the block info directly from BlockBeat, right?

So basically BlockBeat (1) sends block info to contractcourt, then (2) sends block info to sweeper. There's a single entry point for each subsystem, each of which is responsible for distributing the block info to the rest of the subsystem. Something like this?

flowchart LR

  block([new block]) --> bb[BlockBeat]
  bb --1--> c[ChainArbitrator]
  bb --2--> us[UtxoSweeper]
  c ~~~ us
  subgraph cc[contractcourt]
    c --> ca1[ChannelArb1]
		c --> ca2[ChannelArb2]
		c --> ca3[ChannelArb...]
    ca1 ---> ar[AnchorResolver]
    ca1 ---> cr[CommitResolver]
    ca1 ---> hr[HtlcResolver...]
    ca2 ---> rs2[Resolvers...]
    ca3 ---> rs3[Resolvers...]
  end
  subgraph s[sweep]
		us ---> b[Bumper]
  end

morehouse avatar May 03 '24 16:05 morehouse

But the sweeper gets the block info directly from BlockBeat, right?

Yeah exactly.

So basically BlockBeat (1) sends block info to contractcourt, then (2) sends block info to sweeper. There's a single entry point for each subsystem, each of which is responsible for distributing the block info to the rest of the subsystem. Something like this?

Close enough! I'd like to view Bumper an independent system as it can be moved out of the sweep package. Another important new behavior is the subsystem must ACK a done signal back after processing the block - this would defintely require more attention when designing/refactoring subsystems, as it forces a designer to define what does it mean when we say the subsystem has finished processing a block. Maybe it's more clear using a sequence graph,

sequenceDiagram
		autonumber
		participant bb as BlockBeat
		participant cc as ChainArb
		participant us as UtxoSweeper
		participant tp as TxPublisher
		
		note left of bb: 0. received block x,<br>dispatching...
		
    note over bb,cc: 1. send block x to ChainArb,<br>wait for its done signal
		bb->>cc: block x
		rect rgba(165, 0, 85, 0.8)
      critical signal processed
        cc->>bb: processed block
      option Process error or timeout
        bb->>bb: error and exit
      end
    end

    note over bb,us: 2. send block x to UtxoSweeper, wait for its done signal
		bb->>us: block x
		rect rgba(165, 0, 85, 0.8)
      critical signal processed
        us->>bb: processed block
      option Process error or timeout
        bb->>bb: error and exit
      end
    end

    note over bb,tp: 3. send block x to TxPublisher, wait for its done signal
		bb->>tp: block x
		rect rgba(165, 0, 85, 0.8)
      critical signal processed
        tp->>bb: processed block
      option Process error or timeout
        bb->>bb: error and exit
      end
    end

yyforyongyu avatar May 04 '24 07:05 yyforyongyu

Cool. I think this could be used to implement https://github.com/lightningnetwork/lnd/issues/8166. We would just need to make sure that the initial block is distributed by BlockBeat before the peer manager starts.

morehouse avatar May 06 '24 16:05 morehouse

Does TxPublisher need to process the block before UtxoSweeper?

Suppose a conflicting transaction confirms at block X (e.g., https://github.com/lightningnetwork/lnd/issues/8737). If UtxoSweeper processes block X first, it won't know about any inputs that need to be rebatched due to the conflicting transaction. So publishing a rebatched transaction won't happen until block X+1 confirms.

But if TxPublisher processes block X first, it will fail the conflicted inputs, thereby allowing UtxoSweeper to publish a rebatched transaction immediately.

morehouse avatar May 07 '24 18:05 morehouse

Good question @morehouse. I need to think about it more, whether UtxoSweeper -> TxPublisher, or the other around, or in parallel. The issue with TxPublisher -> UtxoSweeper means every tx will be broadcast at X+1. Meanwhile, once this TODO is in place, I think we can eliminate creating conflicting txns completely?

reconstruct the input set from the mempool tx, which is the missing part that gives the sweeper the ability to recover its state from a restart.

yyforyongyu avatar May 08 '24 07:05 yyforyongyu

The issue with TxPublisher -> UtxoSweeper means every tx will be broadcast at X+1.

I don't think so? TxPublisher.Broadcast will still do the initial broadcast immediately.

Meanwhile, once this TODO is in place, I think we can eliminate creating conflicting txns completely?

Conflicting transactions aren't only created by us. The counterparty can also spend HTLCs, causing conflicts.

morehouse avatar May 08 '24 15:05 morehouse