Improve support for full history servers
Summary
Recent issues related to the size and maintenance of sqlite databases (#5102 and #5095) highlight infrastructure challenges for users that operate full history servers.
Solution
This issue does not propose a solution, but is meant as a spot for ongoing discussion of requirements and corresponding solutions, continuing the conversation that had started in #5102.
Some pain points off the top of my head:
- SQLite databases growing to several TiB in size (especially transactions.db)
- NuDB files are already growing beyond 16 TiB (makes it hard to use on most file systems)
- Hard to bootstrap a new full history server in a trustless way (beyond "here, just take this database and use it - if you're worried just run https://xrpl.org/docs/references/http-websocket-apis/admin-api-methods/logging-and-data-management-methods/ledger_cleaner for a few weeks/months or so becauser
--importdoes no validation"), no public full history dumps are available, not even ones for certain ledger ranges - Impossible(?) to have instances only responsible for certain ledger ranges (e.g. run a server that only serves data for ledgers 32.570-10.000.000 and later merge results if a query comes in) to be able to distribute load - "full history" always means "full history on a single machine" not "full history available within a cluster of machines"
- Node database has to be local, very fast for random reads and can't be shared between instances afaik
- The only actually useful database backend for node.db is NuDB (https://github.com/cppalliance/NuDB) - written a few years ago and with no real command line tools, other clients in various languages to access data, any community around it to speak of and 1 single commit fixing some includes in the past 3 years. When I asked about some corner case almost a decade ago, there was not much help from the author (https://github.com/cppalliance/NuDB/issues/46).
@MarkusTeufelberger we are considering refreshing RocksDB dependency to a more recent version (it will require code changes), would that help with the last point ?
I'm not sure if it is usable for full history servers, I ran into issues with it years ago, switched to NuDB and never looked back tbh. In general it would help somewhat, if it were indeed a viable option, yes.
Since it is (too) easy to criticize, here are some potential solutions for the points I brought up:
- Split
transaction.dbandledger.dbfiles by ledger height (e.g. every million or 10 million ledgers?) into their own file. SQLite can handle really big database files, but this current state seems like it is getting into edge-case territory already. -
node.dbfiles similarly should be split by ledger height (this is what shards did). There will be some duplication of rarely changing inner and leaf nodes across these files, but that's a rather trivial overhead imho. - Shards were also designed partially to solve this issue. I'm not 100% sure if they ended up being created deterministically, but I know that they at least can be created that way. This would enable stable P2P distribution (e.g. BitTorrent, IPFS) and out-of-band verification of such database files (technically a ledger hash from a trusted source is enough to verify the full contents of a shard cryptographically).
- This one seems more difficult, but maybe with standalone mode this might already be possible? I didn't look too much into horizontal scaling of
rippledclusters (probably connected through some middleware that does the actual request parsing and routing, like for xrplcluster). Might be more of a documentation and middleware issue and a case of "well, you just need to do it this way". - I would maaaaybe propose an option of querying a (local) caching server for K/V pairs such as memcached, before actually hitting any database because it might be easier to share something readonly/read-through between several running instances that way. I/O wise however a lot of operations just seem to require huge amounts and I'm not sure if even a local network (even a virtual one in the case of several instances running on the same machine) would help in that regard. Full history servers especially more likely get requests for stuff that is no longer in the "hot zone" so to speak. Another option might be to try out a fully networked K/V store for node data. Then again: Is network I/O really fast enough if disk I/O already struggles? OTOH databases are generally filled with magic pixie dust, might at least be worth a try to store transactions, ledgers, wallet/config and nodes in Postgres and run some benchmarks.
- Similar to above, no idea if RocksDB is even as performant as this very bare-bones implementation of an algorithm, but at least there's a bit of an ecosystem around it. OTOH it would likely make it much harder to deterministically create ledger dump files to be distributed.
Coming in cold after a few years away from rippled development ...
- #5066 has some discussions on the challenges in supporting shards. Referencing here as we consider any additional strategies.
- I do recall the concerns on deterministic shards and agree that should be achievable and important to enable out-of-band verification and benefit from those other distribution methods.
- I also understood that clio was in part meant to solve the storage/sharding challenges that FH servers face, in addition to better performance for reporting use cases. It also appears to have mixed adoption. Due to the cold start challenge of getting data? Due to the need of operating additional data backends? Understanding those challenges/gaps would be helpful too.
I still see History Sharding from this link. Is it the same thing removed in https://github.com/XRPLF/rippled/pull/5066 🤔 ?
Clio 's full history node uses less than 7TB data storage and maintains more off-chain indexes. But FH Clio needs another full history rippled and months of time to setup 🥲 .
I still see
History Shardingfrom this link. Is it the same thing removed in #5066 🤔 ?
Yes. It wasn't working very well and added fair amount of complexity, so we decided to remove it.