rippled icon indicating copy to clipboard operation
rippled copied to clipboard

A version of `fetchNodeNT` does not use any sync filters

Open intelliot opened this issue 2 years ago • 2 comments

Issue Description

There are two versions of fetchNodeNT, one of which doesn't use any sync filters or canoncalize through the cache. That one is called in several places because fetchNode calls it. The node could be in a sync filter and we won't know we have it.

We need to audit all code paths that end in that version of fetchNodeNT (the one that doesn't take a sync filter) to understand if it's desirable and safe to ignore the sync filter in each case.

The potential impact of this is not really clear yet. One possibility is that an operation might fail when trying to access a historical SHAMap because the node it was looking for had been recently acquired from a peer and hadn't been put in the database yet. Even though the SHAMap was set to check the nodes received from peers, that setting could be ignored by the path that doesn't use the filters.

The flipside, which could even be worse, is if a node not in the database was in the cache (say because we built it locally), it might not get into the database but get attached to the SHAMap, causing a missing node in the database. That seems less likely to actually happen, but is theoretically possible.

intelliot avatar Feb 23 '23 23:02 intelliot

note: It may be possible for fetchNodeNT (and the sync cache) to go away with the introduction of FLRv2

intelliot avatar Apr 27 '23 18:04 intelliot

TL;DR Summary

In my opinion we have a choice between three options:

  1. We decide that this problem has not bitten us yet and choose to live with the risk.
  2. We perform the audit as suggested, with a questionable outcome.
  3. We actually fix all of the necessary interfaces so a sync filer is passed in, which would actually solve the problem.

We should either choose to live with the risk, and close the issue, or actually fix the concern. The more conservative, and costly, approach is to fix the concern.

Details

First, some history. The version of fetchNodeNT() that doesn't use sync filters was introduced by David Schwartz in commit 8bda9487c6f3dd470d618a414d9469d3139f9d09 which was added in Feb 15 2015. That's close to nine years ago.

Regarding doing an audit, I think that is both

  • Impractical and
  • Impermanent.

The SHAMap public methods that use the fetchNodeNT() of concern include:

  • visitNodes used by SHAMapStoreImp.cpp, Database.cpp, Shard.cpp, SHAMapSync, and DatabaseShard_test.cpp.
  • walkMap used by Ledger.cpp, and SHAMapSync_test.cpp.
  • walkMapParallel used by Ledger.cpp.
  • upper_bound used by Ledger.cpp, LedgerHandler.cpp, View_test.cpp, and ReportingETL_test.cpp.
  • lower_bound used by LedgerHandler.cpp, View_test.cpp, and ReportingETL_test.cpp.
  • delItem used by Ledger.cpp, RCLCxTx.h, SHAMapSync_test.cpp, and SHAMap_test.cpp.
  • addGiveItem used by AmendmentTable.h, Ledger.cpp, LedgerReplayMsgHandler.cpp, FeeVoteImpl.cpp, and NegativeUNLVote.cpp.
  • updateGiveItem used by Ledger.cpp.
  • getProofPath used by LedgerReplayMsgHandler.cpp, and SHAMap_test.cpp.
  • invariants used by Ledger.cpp, SHAMapSync_test.cpp, and SHAMap_test.cpp.
  • peekItem(uint256 const&) used by Ledger.cpp, and RCLCxTx.h.
  • peekItem(uint256 const&, SHAMapHash&) used by Ledger.cpp,
  • SHAMap::const_iterator::const_iterator(SHAMap const*).
  • SHAMap::const_iterator::operator++ used by lots of files in for loops and STL algorithms.
  • SHAMap::const_iterator::operator++(int).

Impractical

In my opinion the number of places that end up invoking the concerning fetchNodeNT() is too broad for a human audit to be effective. And the audit is looking for something that is not easy to identify. How do we know whether it's safe for a particular code path to be ignorant of a node that is currently in a sync filter? This is not an easy thing for an ordinary code reviewer to recognize.

Impermanent

The rippled code base is constantly changing. So even if we do an audit today, there is no guarantee that the next code modification won't do something risky with this interface. Such risky behavior could involve calls to any of the above listed SHAMap methods.

Wadda We Do?

I see three choices.

  1. Well, things seem to have been working okay for almost 9 years now. Yeah, there are risks, but is this the riskiest place in our code? Maybe we leave this as is and hope that our tests catch anything funny?

  2. We ignore the impracticality and impermanence of the audit and do the audit anyway. Somebody might spot something funny that needs to be fixed.

  3. We add a sync filter reference to all of the above methods and pass that reference down the call chain to fetchNodeNT(). This, or something like it, is really the only lasting long-term solution to any concerns we might have in this area. This approach might require making a new "empty" sync filter that can be passed in safe circumstances.

If we choose to do something more than just close the issue, then the amount of work for the audit is probably comparable to the work to add the sync filter parameter. So, if we choose to do something, then we should just fix the problem so we don't have to worry about it in the future.

scottschurr avatar Dec 11 '23 21:12 scottschurr