bdk icon indicating copy to clipboard operation
bdk copied to clipboard

The `TxGraph::missing_heights` implementation is problematic.

Open evanlinjin opened this issue 1 year ago • 8 comments

The logic of missing_heights is explained clearly in this comment:

// Usually, if a height of a tx anchor is missing from the chain, we would want to return
// this height in the iterator. The exception is when the tx is confirmed in chain. All the
// other missing-height anchors of this tx can be skipped.

However, if a transaction is confirmed in a block that is about to turn stale (get reorged out), we still want to fetch the new block hash. I suspect this is where our "transaction is never confirmed" bug comes from.

My proposed solution:

Change the exception to:

// ... The exception is when the tx is confirmed in chain with a depth greater than the provided
// `max_reorg_depth`.

And we change the signature of the method to be:

pub fn missing_heights<'a>(
    &'a self,
    chain: &'a LocalChain,
    max_reorg_depth: u32,
) -> impl Iterator<Item = u32> + 'a { todo!() }

With some nice documentation for max_reorg_depth.

An alternative solution:

Remove the stated "exception". We can continue to consider a height "missing" when there is a floating anchor at that height (anchor that does not point to a block in the best chain).

The downside of this approach is that there are situations where certain heights will always be considered "missing". I think this is somewhat acceptable as reorgs are rare and this situation only happens due to reorgs.

Because of the behavior change, if we do this solution, we should rename the method to floating_anchor_heights.

How to reproduce with bdk_esplora:

The following should be written as a test.

  • Wallet receives money via transaction (A).
  • Transaction gets confirmed at confirmation height (h).
  • (Sync wallet with bdk_esplora),
  • A reorg happens, and (A) gets re-confirmed at height (h') where h' >= h.
  • (Sync wallet with bdk_esplora).
  • At this point onwards, transaction (A) is forever stuck as unconfirmed!

Going forward:

The test should be written first while we wait for the discussion on which solution to do matures.

evanlinjin avatar Feb 20 '24 04:02 evanlinjin

Tests written can be based on #1171

evanlinjin avatar Feb 20 '24 04:02 evanlinjin

I do not know what the root cause for the forever unconfirmed one, but I think I can make the test first, and then fix the issue after the test.

yanganto avatar Feb 21 '24 03:02 yanganto

First I'll respond inline

The logic of missing_heights is explained clearly in this comment:

// Usually, if a height of a tx anchor is missing from the chain, we would want to return
// this height in the iterator. The exception is when the tx is confirmed in chain. All the
// other missing-height anchors of this tx can be skipped.

However, if a transaction is confirmed in a block that is about to turn stale (get reorged out), we still want to fetch the new block hash.

Thanks for figuring this out. Let me restate the key observation. Imagine a chain with tip t and a transaction anchored at h and h'. If h is in the chain with t then missing heights will not return h' as "missing", however, if h is re-org'd out and the tip replaced with t' all of a sudden h' would be "missing" and returned. The problem is we are making the transition from t -> t' after we've called missing_heights and so we manage to transition to t' and but still never notice that h' is missing.

Then, because the person may be using missing_heights on the update transaction graph (which is not what I had in mind with this method but certainly a mistake you could make) they never notice that h' is missing from their main tx graph because they only ever call it on the update graph.

I suspect this is where our "transaction is never confirmed" bug comes from.

I think this is extremely unlikely to be the source of any real world observed bug. Reorgs simply don't happen often enough. it's also easy to verify this theory -- take a look at the tx anchors when the bug occurs. Does it have two?

My proposed solution:

Change the exception to:

// ... The exception is when the tx is confirmed in chain with a depth greater than the provided
// `max_reorg_depth`.

And we change the signature of the method to be:

pub fn missing_heights<'a>(
    &'a self,
    chain: &'a LocalChain,
    max_reorg_depth: u32,
) -> impl Iterator<Item = u32> + 'a { todo!() }

With some nice documentation for max_reorg_depth.

An alternative solution:

Remove the stated "exception". We can continue to consider a height "missing" when there is a floating anchor at that height (anchor that does not point to a block in the best chain).

The downside of this approach is that there are situations where certain heights will always be considered "missing". I think this is somewhat acceptable as reorgs are rare and this situation only happens due to reorgs.

Because of the behavior change, if we do this solution, we should rename the method to floating_anchor_heights.

The method does what it says it's just not a very useful method. You even made this observation in #1126 but I NACK'd it because I thought this method could have a use outside of typical syns.c I didn't think we were actually still using this method anywhere but it looks like it is used in the wallet examples somehow. These should be changed to the ChangeSet::missing_heights_from like the other example. After understanding what @evanlinjin has discovered here I think he's right in #1126 -- we should just delete this method.

BTW I think it is always the wrong idea to have methods that require the user to specify a "max_reorg_depth". Our algorithms should be able to logically tolerate any re-org depth without exception.

How to reproduce with bdk_esplora:

The following should be written as a test.

I don't think so. Is the point of the test to make sure bdk_esplora returns the right results or missing_heights returns the right heights? Neither of them are wrong, missing_heights is just being used in the wrong way (arguably there's no correct way to use it with the current blockchain APIs we have). You can't test for API bugs you need to remove them by designing a better one!

@yanganto don't waste your time writing a test here yet.

LLFourn avatar Feb 28 '24 06:02 LLFourn

@LLFourn thank you for the response. I think we can still write tests with the ChangeSet::missing_heights_from method.

evanlinjin avatar Feb 28 '24 09:02 evanlinjin

@LLFourn thank you for the response. I think we can still write tests with the ChangeSet::missing_heights_from method.

Yes. I am objecting to these tests being integration tests that actually make esplora calls. You can simulate re-orgs without running esplora.

We have a issue of understanding the implications of our design. I think it's wasteful to try and tackle those with integration tests which are hard to maintain and will quickly go stale. Gaining understanding about design is a painful process because it exposes our flaws and may cause us to lose confidence along the way. But you have to try and resist the urge to call for integration tests to address bugs in our understanding -- this can only make the situation worse. If we can't understand it, it's probably too complicated. I think #1194 is the right approach because it redesigns the process that doesn't require this call at all so the user can't even make this mistake. In theory this is possible because in eslplora/electrum you can always assume the "missing heights" is the heights of all the transactions after what we call the "point of agreement".

To demonstrate that the problem is lack of understanding here -- I think we're wonrg. I've changed my mind: there is nothing problematic about the missing_heights API. I claim that the way it's being used right now is totally fine. This claim is wrong:

However, if a transaction is confirmed in a block that is about to turn stale (get reorged out), we still want to fetch the new block hash.

We will always fetch the new block hash if missing_heights is called on the update TxGraph because it will have all the new anchors. Each tx will have at most one anchor. TxGraph::missing_heights will return the heights of the anchors is doesn't have which will be all the new ones. The problem we are imagining here is actually only applicable to when you call missing_heights on the existing TxGraph after you've updated it. But this will only be a problem for the first call, it will heal itself on the next call.

I think the only work to do here is to delete TxGraph::missing_heights since it just doesn't need to exist in any workflow we're aware of and it requires too much insight to guarantee you're using it correctly (and the docs don't provide this). It's also hard to implement correctly.

LLFourn avatar Feb 29 '24 02:02 LLFourn

I'll work on getting rid of TxGraph::missing_heights.

evanlinjin avatar Mar 05 '24 08:03 evanlinjin

I think that tx_graph::ChangeSet::missing_heights_from is not the perfect API either. We are only fetching checkpoints for newly introduced anchors. However, if fetching checkpoints fail, and the newly introduced anchors are already persisted, we will never fetch the required checkpoints again.

As an example, let's say the caller is syncing with Esplora. EsploraExt::full_scan succeeds and it turns out we need a checkpoint at height 100 to connect anchor 100:A. However, the network fails so the call to EsploraExt::update_local_chain fails. Based on how our esplora_examples are written, this is okay since we return early without persisting if anything fails. However, if the caller persists after the full_scan update, they may be left with a situation where a corresponding checkpoint is never fetched.

Another problem is that tx_graph::ChangeSet::missing_heights_from does not work well with bdk_wallet since we don't actually return the changeset in wallet methods. A solution to this would be to provide a method that calls ::missing_height_from on the staged changeset. However, what if the caller commits the staged changeset before fetching missing heights? (we'll have the same problem as stated in the aforementioned example). Another solution would be to have the method output missing heights after applying update. However, if fetching relevant checkpoints fail, and we don't somehow record the missing heights, we won't have another opportunity to fetch corresponding checkpoints again.

Because relying on a returned changeset to find missing checkpoints is problematic, I think the safest solution would be to determine missing heights from the wallet's TxGraph (however not with TxGraph::missing_heights, as it was problematic as stated in this PR). Therefore I think my second proposal - TxGraph::floating_anchor_heights - which fetches all heights that contains anchors that do not point to the best chain, makes the most sense).

evanlinjin avatar Mar 05 '24 16:03 evanlinjin

After chatting with @LLFourn a couple of days ago, I am convinced by his idea to remove the need to lock the wallet multiple times. @LLFourn suggested that since we are already passing a linked list of checkpoints to the chain source, we can use that to determine missing heights instead of locking LocalChain. This will require iterating backwards to determine whether update anchors connect with our chain - which is expensive. However, we can use a skip list structure to make checkpoint traversal more efficient. I think this should be our first step and that means we can get rid of EsploraExt::update_local_chain.

evanlinjin avatar Mar 05 '24 22:03 evanlinjin