PieDb discussion on how to fix slow rebuild from TransactionLog on startup

This hasn't really been an issue yet, but it will be in the future.

I can think of several approaches to improving the performance of this off the top of my head:

serializing the in-memory DataStore to disk on shutdown, and rebuilding it from that instead of the transaction log. This won't necessarily be fast for massive data sets though.
Using a memory-mapped file to store the data, although I don't know much about this
(asyncronously) writing objects to disk, so that they are lazy-loaded or loaded in the background when they are accessed, and there is no need to re-run the transaction log.
'cleaning' the transaction log (by removing unnecessary transactions, like discarding edits). This wouldn't scale that well, probably.
Sharding the transaction log by key, and lazy rebuilding. This will make transactional write to disk harder...
Rebuilding the transaction log in parallel, sharding the work over the individual cores. not an order of magnitude different
If this is really important, don't go embedded, and have a long running system with a slow startup time and a network API...

thoughts on these approaches, and any others very welcome!

Jul 29 '13 17:07 mcintyre321

What's the goal of the database? As far as I can tell (please correct me if I've understood the code wrong) you have a transaction log and an in-memory representation. Are you building an in-memory database like Redis or are you building a persisted database?

With a persisted database, transactions commit to the disk in an on-disk structure like an LSM or B-Tree. Replaying the commit log is simpler in this case. For an in-memory database snapshots are more efficient because you can compress and reduce the biggest replay cost which is disk I/O but start-up is going to be a lot slower than with a persisted database because you have to re-load the last known memory state and in some cases additionally re-play the last portion of the commit log if the last known snapshot is behind the commit log.

Here is a highly recommended paper about writing reliable WAL's which a lot of databases inspire from http://202.202.43.2/users/1008/docs/6176-1.pdf

An alternative approach to database reliability for a persisted database is shadow paging. In this case you don't need to replay anything when failures happen.

Jul 29 '13 18:07 kellabyte

Thanks, really informative. I think for now the in-memory approach suits the project better - it mainly aims to provide a free embeddable database for smaller (or sharded) applications, and persisting/loading the objects to/from an on disk structure will be quite a lot of tricky work, particularly if I want to stay in managed code.

Given that, a zipped snapshot approach (and some parallelism when rebuilding) could go a fair way to mitigating the startup issues. Something will have to be done though, as the current transaction log format (consecutive json objects) will definitely chug!

Jul 30 '13 12:07 mcintyre321

Keep in mind the workload use cases the database design will dictate. Being an in-memory only database means it will be suited for more cache-oriented workloads and not general purpose database uses. If that's the goal then that's good but a lot of the pain with Redis for example is people misusing it as a general purpose database. Keep the architecture design in the direction of the goals of the project :)

In this case, you're more along the lines of a memcache not a persisted database.

Jul 30 '13 12:07 kellabyte

I have to disagree somewhat - given that the data is still durable (unlike with memcached) and has (nascent) querying (unlike Redis) it's on the way to a general purpose db. It's more an implementation detail to an application if data is stored on disk, or rebuilt from a log.

If your definition of a general purpose DB includes having HDD (as opposed to RAM sized) datasets, then you're correct, that's not what is being aimed for right now (although I reckon you could get pretty far with paging LRU objects to disk).

Jul 30 '13 13:07 mcintyre321

If you look at how well engineered persisted databases are designed and implemented dating from the 80's even to today's NoSQL DB's (good engineering never changes) you won't find any good ones doing a WAL + snapshot only type architecture. Database engineering and transaction theory doesn't change regardless of the data model projected at the higher levels of the database.

This is a pitfall and why Redis has failed at bolting on proper persistence multiple times now.

There's nothing wrong with having a WAL + snapshot system, it's just very important to understand what workloads this design is suited for. Not system of record type stuff for example. That's why I asked what the goals are, so that the engineering decisions align with the goals of the project :)

Jul 30 '13 15:07 kellabyte

Are you saying that data loss is bound to ensue with this approach?

Jul 30 '13 16:07 mcintyre321

No, if the WAL is implemented properly you shouldn't have any data loss issues. It's not easy to get a reliable WAL written but assuming one is written it should be good to go :)

Saving snapshots creates a high amount of write amplification which means you're doing a lot of extra I/O than you need to for the life of the database which impacts performance and also the life of SSD's. Not only that, writing them out to disk is an expensive operation it's writing the entire set of the database in one operation (even if throttled) rather than amortized over the life of the set. You can do some smarts and only snapshot changed data but now you're going to use a bunch of compute to figure that out to save disk I/O and that takes away from queries.

Instagram found out that using in-memory databases is the most costly kind of database to scale. The bad trade-off that in-memory databases make is that your least used data will cost as much to operate as your most used data. Some situations this makes sense but not in most cases. Instagram moved away from an in-memory + snapshot database to a normal persistent database and saved 75% cost savings. If your machine on EC2 or Azure has 32GB of RAM and the database grows to 64GB of RAM you need to purchase more CPU and disk to get that RAM even if your compute needs never changed. Your costs just doubled. Your least used data is driving the cost up.

This is why I asked at the very beginning what the goals of the project are because database engineering design choices dictate what purpose the database is suited for and the types of workloads that make sense.

Aug 01 '13 14:08 kellabyte