nimbus-eth2 icon indicating copy to clipboard operation
nimbus-eth2 copied to clipboard

[RFC] Log-based database for hot data

Open zah opened this issue 2 years ago • 6 comments

After the migration to ERA files for cold data is completed, it is expected that the size of the hot (non finalized) set of blocks and states will take significantly less space under normal operation. Even in the presence of long stretches of non-finalization, it's difficult to imagine that the hot database will need millions of records, so we can safely assume that it would be practical to maintain a full index of the available information in memory.

With the above realization in mind, it's possible for us to implement a very simple log-based database offering extremely efficient performance both for writing and reading. It would function similar to a write ahead log in a typical database. Writes are atomically added to an append-only log file. The entries in the log file will contain offsets into data files where the actual data is stored. When the application is restarted, it scans the log file to reconstruct an in-memory index of the locations of all existing records. Records can be loaded by memory mapping the respective data files to take maximum advantages of the file system cache of the OS. Pruning of data is achieved by organizing the data files by epochs, so when a certain epoch is finalized we can just remove a set of files from the hot database. The log file can be split in segments as well, so old segments can be removed (please note that some file systems support sparse files which is another way to prune the log file, but this would be less portable).

Expected wins:

  • Improved latency for both writing and reading data to the database (due to the O(1) disk operations)
  • Reduced database size (due to constructing the index only in memory)
  • Reduced code size (with the potential removal of SQLite as dependency)

An existing widely used log-based database that we can study is the persistent mode of Redis.

Please note that it's also possible to implement the "spirit" of this design by using tables in SQLite that are indexed only according to the pruning criteria.

zah avatar Feb 19 '22 20:02 zah

One interesting aspect is that loading a block from sqlite (by root) is about as fast as loading it from an open era file by slot - both hovered around the 50us mark on my ssd, when I eyeballed it in #3394 - this would probably take some real benchmarking, but it might be that the actual wins here are .. small - opening an era file and then reading the block is at least 20-50x slower (in the milliseconds) due to filesystem overhead (on ext4 at least).

Another way to put this: sqlite is pretty efficient, it turns out, and not opening files is a good idea in general. If life continued as it were, we could probably keep "recent" blocks in memory even and it would be fine, but with the merge coming up the block equation will change completely, and we might find ourselves in a position where we need to be more careful around them - in particular, it might be that we must start storing headers separately from bodies.

Another point: sqlite supports mmap but it's not enabled at the moment - playing with address spaces feels kind of risky, some benchmarks would be welcome to entertain ideas like this.

arnetheduck avatar Feb 19 '22 22:02 arnetheduck

Oh, and one more thing: I've thought about the notion of creating era files by appending finalized blocks one by one to an append-only file in the "interim" period while the full era is incomplete - these can then be turned into full era files at the right moment by appending a state - likely, this would be the most efficient way to create them to begin with.

arnetheduck avatar Feb 19 '22 22:02 arnetheduck

If opening files is slow, isn't the simple answer "don't open files"? The data files described here are also append-only and they can be pretty large. You can keep them ready for seeking/mapping. Since SQLite uses a write ahead log and a B-Tree, it clearly needs to perform more write operations than the append-only updates described here. The similarity you are citing is in reading speed which is easier to imagine considering that the top layers of the B-Tree are likely to be cached in memory.

zah avatar Feb 19 '22 23:02 zah

If opening files is slow

sure, it's just one more complication in general to be aware of, adding to the implementation cost.

more write operations than the append-only updates described here.

well, sort of - but when the chain starts forking, this scheme slows down when it's needed the most, because you need to start reorganizing data - it's important to remember that we should keep forking in mind as a primary optimization target such that we don't suffer a self-reinforing performance degradation spiral in this case.

Shuffling data around is actually one of the stronger arguments against era files: we have to move data yet another time which is .. unfortunate.

The point about sqlite is more to say that even with 3mil blocks in the database, the b-tree overhead is negligible (maybe it's a cache indeed) - this is why benchmarking would help give an idea about what benefits actually can be had at the margin.

arnetheduck avatar Feb 20 '22 07:02 arnetheduck

Perhaps we are imaging the described implementation in a different way because I don't see how forking increases the cost. With forks or no forks, you always append to a specific data file, there is no re-organization required. When you restart the client, it will scan the log files and it will rebuild the forked DAG in memory. I can also provide some handwavy arguments that due to the slashing rules, there is a maximum rate of produced blocks in the network which can increas at a rather slow rate.

zah avatar Feb 20 '22 11:02 zah

ah ok. yeah, that could work I guess, if you get the file management right..

arnetheduck avatar Feb 20 '22 19:02 arnetheduck