neon Add new compaction abstraction, simulator, and implementation.

This consists of three parts:

A refactoring and new contract for implementing and testing compaction.

The logic is now in a separate crate, with no dependency on the 'pageserver' crate. It defines an interface that the real pageserver must implement, in order to call the compaction algorithm. The interface models things like delta and image layers, but just the parts that the compaction algorithm needs to make decisions. That makes it easier unit test the algorithm and experiment with different implementations.

I did not convert the current code to the new abstraction, however. When compaction algorithm is set to "Legacy", we just use the old code. It might be worthwhile to convert the old code to the new abstraction, so that we can compare the behavior of the new algorithm against the old one, using the same simulated cases. If we do that, have to be careful that the converted code really is equivalent to the old.

This inclues only trivial changes to the main pageserver code. All the new code is behind a tenant config option. So this should be pretty safe to merge, even if the new implementation is buggy, as long as we don't enable it.

A new compaction algorithm, implemented using the new abstraction.

The new algorithm is tiered compaction. It is inspired by the PoC at PR #45su39, although I did not use that code directly, as I needed the new implementation to fit the new abstraction. The algorithm here is less advanced, I did not implement partial image layers, for example. I wanted to keep it simple on purpose, so that as we add bells and whistles, we can see the effects using the included simulator.

One difference to #4539 and your typical LSM tree implementations is how we keep track of the LSM tree levels. This PR doesn't have a permanent concept of a level, tier or sorted run at all. There are just delta and image layers. However, when compaction starts, we look at the layers that exist, and arrange them into levels, depending on their shapes. That is ephemeral: when the compaction finishes, we forget that information. This allows the new algorithm to work without any extra bookkeeping. That makes it easier to transition from the old algorithm to new, and back again.

There is just a new tenant config option to choose the compaction algrithm. The default is "Legacy", meaning the current algorithm in 'main'. If you set it to "Tiered".

A simulator, which implements the new abstraction.

The simulator can be used to analyze write and storage amplification, without running a test with the full pageserver. It can also draw an SVG animation of the simulation, to visualize how layers are created and deleted.

To run the simulator:

./target/debug/compaction-simulator run-suite

Sep 07 '23 14:09 hlinnaka

Here's a screen capture of what the simulator output looks like

https://github.com/neondatabase/neon/assets/191602/5b47d977-26b3-4f06-ab79-22ddd0b17035

Sep 07 '23 14:09 hlinnaka

2358 tests run: 2243 passed, 0 failed, 115 skipped (full report)

Code coverage (full report)

functions: 53.3% (8908 of 16701 functions)
lines: 79.4% (51442 of 64804 lines)

_{The comment gets automatically updated with the latest test results
42d7299f9f9e49741523917c250b449c3d83d324 at 2023-11-07T09:29:32.418Z :recycle:}

Sep 07 '23 14:09 github-actions[bot]

I really like this.

Thinking forward, pluggable compaction will play well with a couple of other places I'm thinking about adding "special" compactions in future:

Sharding splits, where we might want to break up the keyspace a particular way when splitting up workloads
When we hibernate idle timelines, it might make sense to compact small databases down to a single compressed object for efficiency in long term storage.

Other thoughts from reading this:

The existence of the Timeline::compact_tiered method suggests that maybe the CopmactionJobExecutor trait should expose a little bit more information so that compact_tiered::compact_tiered can do the whole job, rather than having the very early steps in compaction live in Timeline. Maybe the compaction interface needs a concept of a prepare phase and an execute phase? This might become more clear when porting existing compaction to the new interface, if both impls end up with the same l0 depth check at the start.
Ideally, compaction impls should have their own unit tests, maybe using the simulator's recorded outputs will be handy for creating regression cases for that. We should make sure that the interface we're defining is sufficient to enable that testing (it may well already be so, it's not always obvious until we actually write the tests).
The simulator is good for thinking about algorithmic complexity, I'm also keen to get a handle on the empirical performance aspect, so hopefully we can do a similar thing in future that runs with real layers types doing real I/O. It could make a really nice regression test for imposing a "this workload should do more than N I/Os" type regression tests.

Sep 08 '23 12:09 jcsp

Thinking forward, pluggable compaction will play well with a couple of other places I'm thinking about adding "special" compactions in future:

Sharding splits, where we might want to break up the keyspace a particular way when splitting up workloads

When we hibernate idle timelines, it might make sense to compact small databases down to a single compressed object for efficiency in long term storage.

Makes sense

Other thoughts from reading this:

The existence of the Timeline::compact_tiered method suggests that maybe the CopmactionJobExecutor trait should expose a little bit more information so that compact_tiered::compact_tiered can do the whole job, rather than having the very early steps in compaction live in Timeline.

Hmm, yeah, I see what you're saying. There isn't much in Timeline::compact_tiered, though, and I think the steps there would apply to any compaction algorithm that uses the new interface. So perhaps it needs to be renamed to Timeline::compact_with_new_interface, or simply timeline::compact.

Maybe the compaction interface needs a concept of a prepare phase and an execute phase? This might become more clear when porting existing compaction to the new interface, if both impls end up with the same l0 depth check at the start.

The L0 depth check is really just an optimization, to avoid calling the actual compaction when we know there's nothing to do. The compaction algorithm itself would reach the same conclusion and do nothing. It might be a premature optimization, but we nevertheless need to check the L0 layers anyway to find the top of the tree, i.e. the LSN of the newest L0. Or we could track that more explicitly.

Ideally, compaction impls should have their own unit tests, maybe using the simulator's recorded outputs will be handy for creating regression cases for that. We should make sure that the interface we're defining is sufficient to enable that testing (it may well already be so, it's not always obvious until we actually write the tests).

+1. I didn't include any unit tests for the algorithm here, but yeah we should have them.

The simulator is good for thinking about algorithmic complexity, I'm also keen to get a handle on the empirical performance aspect, so hopefully we can do a similar thing in future that runs with real layers types doing real I/O. It could make a really nice regression test for imposing a "this workload should do more than N I/Os" type regression tests.

+1. @bojanserafimov had similar thoughts at https://github.com/neondatabase/neon/pull/4411. Some thoughts on this:

It's pretty straightforward to extract the keys, LSNs and record lengths from existing layers. There would be no real database data in such a dump, so we could extract that from real databases and play with the dump pretty freely.
We could also extract such a dump from the original WAL in S3. The original WAL needs some processing to turn them into key+LSN+length format, because one PostgreSQL WAL record can become multiple records in the storage, and the mapping from relfilenode, block number etc to the storage Key is a little complicated. But it would be possible to write such a tool.
The compaction algorithm also needs the "KeySpace", i.e. the information of which keys exist at a given LSN. That information is stored in some special key-value pairs, and it would need some extra work to reconstruct it from the WAL, or to extract it from existing layer files. But it's also doable.

Sep 08 '23 13:09 hlinnaka

Other thoughts from reading this:

The existence of the Timeline::compact_tiered method suggests that maybe the CopmactionJobExecutor trait should expose a little bit more information so that compact_tiered::compact_tiered can do the whole job, rather than having the very early steps in compaction live in Timeline.

Hmm, yeah, I see what you're saying. There isn't much in Timeline::compact_tiered, though, and I think the steps there would apply to any compaction algorithm that uses the new interface. So perhaps it needs to be renamed to Timeline::compact_with_new_interface, or simply timeline::compact.

I did some renaming and added a few comments to address this: https://github.com/neondatabase/neon/pull/5234/commits/e52cc061e89620b270fe2db57d2ab8fa6a2ad3a1

Maybe the compaction interface needs a concept of a prepare phase and an execute phase? This might become more clear when porting existing compaction to the new interface, if both impls end up with the same l0 depth check at the start.

The L0 depth check is really just an optimization, to avoid calling the actual compaction when we know there's nothing to do. The compaction algorithm itself would reach the same conclusion and do nothing. It might be a premature optimization, but we nevertheless need to check the L0 layers anyway to find the top of the tree, i.e. the LSN of the newest L0. Or we could track that more explicitly.

Added a comment about that too.

Ideally, compaction impls should have their own unit tests, maybe using the simulator's recorded outputs will be handy for creating regression cases for that. We should make sure that the interface we're defining is sufficient to enable that testing (it may well already be so, it's not always obvious until we actually write the tests).

+1. I didn't include any unit tests for the algorithm here, but yeah we should have them.

Added some unit tests for the identify_level function with hand-crafted sets of input layers. That was pretty straightforward. We could use more unit tests for the various subroutines of compact_level.

Sep 11 '23 15:09 hlinnaka

I'd like to move this forward. For review, I think there are two criteria:

Is this safe to merge now?

The new algorithm is added behind a new tenant config option, and is disabled by default. It should have no effect, unless you explicitly enable it. Please review the small changes to existing code to make sure that that really is the case, i.e. that this has no effect unless you enable it.
Is this the general direction we want to go?

The new tiered implementation isn't perfect by any means, there's a lot more we could and should do. There are a bunch of TODO and FIXME comments about those things. At least some of them need to be addressed and more testing needs to be performed, before we change the default. But I'd like to get this framework committed first, those things can be addressed as follow-up PRs.

I got some positive feedback from @jcsp and @bojanserafimov already. Any objections, any competing designs?

Please review, and indicate in review comment which angle - or both - you reviewed it from.

Sep 19 '23 08:09 hlinnaka

Rebased this, fixing conflicts. The conflicts were mostly from PR #4938, but they weren't too hard to resolve.

Nov 06 '23 14:11 hlinnaka

Closing as https://github.com/neondatabase/neon/pull/6830 has been merged.

Feb 29 '24 11:02 arpad-m

neon neon copied to clipboard

Add new compaction abstraction, simulator, and implementation.

2358 tests run: 2243 passed, 0 failed, 115 skipped (full report)

Code coverage (full report)

neon
neon copied to clipboard