rippled
rippled copied to clipboard
A version of `fetchNodeNT` does not use any sync filters
Issue Description
There are two versions of fetchNodeNT, one of which doesn't use any sync filters or canoncalize through the cache. That one is called in several places because fetchNode calls it. The node could be in a sync filter and we won't know we have it.
We need to audit all code paths that end in that version of fetchNodeNT (the one that doesn't take a sync filter) to understand if it's desirable and safe to ignore the sync filter in each case.
The potential impact of this is not really clear yet. One possibility is that an operation might fail when trying to access a historical SHAMap because the node it was looking for had been recently acquired from a peer and hadn't been put in the database yet. Even though the SHAMap was set to check the nodes received from peers, that setting could be ignored by the path that doesn't use the filters.
The flipside, which could even be worse, is if a node not in the database was in the cache (say because we built it locally), it might not get into the database but get attached to the SHAMap, causing a missing node in the database. That seems less likely to actually happen, but is theoretically possible.
note: It may be possible for fetchNodeNT (and the sync cache) to go away with the introduction of FLRv2
TL;DR Summary
In my opinion we have a choice between three options:
- We decide that this problem has not bitten us yet and choose to live with the risk.
- We perform the audit as suggested, with a questionable outcome.
- We actually fix all of the necessary interfaces so a sync filer is passed in, which would actually solve the problem.
We should either choose to live with the risk, and close the issue, or actually fix the concern. The more conservative, and costly, approach is to fix the concern.
Details
First, some history. The version of fetchNodeNT() that doesn't use sync filters was introduced by David Schwartz in commit 8bda9487c6f3dd470d618a414d9469d3139f9d09 which was added in Feb 15 2015. That's close to nine years ago.
Regarding doing an audit, I think that is both
- Impractical and
- Impermanent.
The SHAMap public methods that use the fetchNodeNT() of concern include:
visitNodesused by SHAMapStoreImp.cpp, Database.cpp, Shard.cpp, SHAMapSync, and DatabaseShard_test.cpp.walkMapused by Ledger.cpp, and SHAMapSync_test.cpp.walkMapParallelused by Ledger.cpp.upper_boundused by Ledger.cpp, LedgerHandler.cpp, View_test.cpp, and ReportingETL_test.cpp.lower_boundused by LedgerHandler.cpp, View_test.cpp, and ReportingETL_test.cpp.delItemused by Ledger.cpp, RCLCxTx.h, SHAMapSync_test.cpp, and SHAMap_test.cpp.addGiveItemused by AmendmentTable.h, Ledger.cpp, LedgerReplayMsgHandler.cpp, FeeVoteImpl.cpp, and NegativeUNLVote.cpp.updateGiveItemused by Ledger.cpp.getProofPathused by LedgerReplayMsgHandler.cpp, and SHAMap_test.cpp.invariantsused by Ledger.cpp, SHAMapSync_test.cpp, and SHAMap_test.cpp.peekItem(uint256 const&)used by Ledger.cpp, and RCLCxTx.h.peekItem(uint256 const&, SHAMapHash&)used by Ledger.cpp,SHAMap::const_iterator::const_iterator(SHAMap const*).SHAMap::const_iterator::operator++used by lots of files inforloops and STL algorithms.SHAMap::const_iterator::operator++(int).
Impractical
In my opinion the number of places that end up invoking the concerning fetchNodeNT() is too broad for a human audit to be effective. And the audit is looking for something that is not easy to identify. How do we know whether it's safe for a particular code path to be ignorant of a node that is currently in a sync filter? This is not an easy thing for an ordinary code reviewer to recognize.
Impermanent
The rippled code base is constantly changing. So even if we do an audit today, there is no guarantee that the next code modification won't do something risky with this interface. Such risky behavior could involve calls to any of the above listed SHAMap methods.
Wadda We Do?
I see three choices.
-
Well, things seem to have been working okay for almost 9 years now. Yeah, there are risks, but is this the riskiest place in our code? Maybe we leave this as is and hope that our tests catch anything funny?
-
We ignore the impracticality and impermanence of the audit and do the audit anyway. Somebody might spot something funny that needs to be fixed.
-
We add a sync filter reference to all of the above methods and pass that reference down the call chain to
fetchNodeNT(). This, or something like it, is really the only lasting long-term solution to any concerns we might have in this area. This approach might require making a new "empty" sync filter that can be passed in safe circumstances.
If we choose to do something more than just close the issue, then the amount of work for the audit is probably comparable to the work to add the sync filter parameter. So, if we choose to do something, then we should just fix the problem so we don't have to worry about it in the future.