tigerbeetle Coalesce tables instead of compacting when possible

This patch adds table coalescing to the LSM. Prior to compacting level A of a tree into level B, it instead attempts to coalesce tables in level A such that one table in level A will be completely freed. This is assumed to be more efficient than doing the compaction.

As far as I can tell this patch is fully functional and passes all tests that I am aware of, but it has plenty of deficiencies I will describe. I think though that it is time to start getting some review.

How the search works

The search is implemented in the second commit and is well documented. It is part of manifest_level.zig, with the entry point being find_table_coalesce_window.

This searches through windows of visible tables, the windows up to size lsm_growth_factor; with a limit on the maximum number of tables searched, both visible and invisible. This limit is set by lsm_manifest_coalesce_max_tables_to_search to 1024, but this number is not determined experimentally - it is intended to be "reasonably fast" to search (<1ms?), but I have not done any timing testing. The search never short circuits - it either searches all tables in the level (for small levels), or 1024 tables (for large levels).

On levels with more than 1024 tables it uses a deterministic random number to decide at what index to start the search; this is to spread the search out across table ranges, across compactions, but without the added bookkeeping of remembering the previously searched indexes. @jorangreef previously suggested to use a hash instead of a rng here, but I decided to stick with an rng because the rng has built-in facilities for picking a number in range with the correct statistical distribution, where a hash does not. I suspect the performance of a fast hash and a fast rng is similar, but have not done any testing. The RNG here is PCG, which is fast with good randomness.

Windows are chosen that can free up exactly one table - it never tries to find windows that free more than one table. Of those windows the best is chosen by the choose_coalesce_window_candidate function. This function currently just picks the window with the fewest values.

I don't really understand the I/O costs involved so am not confident that optimizing for fewest values is correct or not. We could alternately take table numbers into account, or short-circuit the search as soon as any viable window is found.

The search logic is very straightforward, but it depends on a set of iterators for which the logic is somewhat complex. This is an intentional decision for the sake of making the search readable.

The CoalasceTableIterator and CoalesceWindowIterator are decently documented, and are designed to account for the cost of visiting invisible tables when limiting the amount of work done per search, and avoid revisiting invisible tables many times in pathological cases. The complexity of these iterators is unfortunate for both readability and performance, and I was reluctant to do it this way, but ultimately decided it was the cleanest way to correctly account for invisible tables. Even today I was tweaking the logic in these iterators so they deserve a close look.

Note that because iterating invisible tables counts toward the limits of the search, levels with many invisible tables will have their coalesce-window search significantly weakened. As I understand it snapshots are a theoretical feature right now, so this is just a consideration for the future.

How integration with compaction works

It's plugged into CompactionPipeline.beat.

Compaction of levels is striped across half-bars. Prior to attempting to compact level b (a into b) with bar_setup, this function first "backs up" and checks whether it can instead coalesce level a.

When this happens it sets level "a"s Compaction object for coalescing (and from the view of Compaction and the rest of the forest code this level "a" will be seen as a level "b", with no level "a" compacting into it).

When coalescing happens, an "odd" level gets compacted in the same half-bar as other "even" levels (or vice-versa). e.g. if our half-bar is compacting levels (0, 2, 4), we might actually end up compacting (0, 1, 4) if it was determined that coalescing level 1 was better than compacting level 1 into level 2.

This mixing of odd and even compaction levels within a single half bar is the reason for the first commit of this patch that removes most of the checks for which level is allowed to be compacting during which half-bar.

This adds a new coalesce variant to TableInfoA that indicates that compaction is doing a coalesce and there is no table A.

Limitations

I have not yet been able to test on large levels - I have only tested up to level 3. This means I haven't ever run the code path where the search randomly selects the table index to begin searching from. I don't even know how to generate a workload that builds deep enough levels to test this - I am running out of memory on large forest_fuzz workloads. I could test on a large-memory cloud VM if that's the right way to do it.

There are no new tests here. I think there should be some unit tests for the coalesce search but haven't tried to write them yet. I think there should also be an integration test that coalescing limits write amplification on small-batch workloads, but don't know how to write it. Also a smoke test that just confirms coalescing is on and doing something.

I don't know how long it takes to search 1024 tables. I think we should have some idea of the CPU time of this operation.

I don't know how often this optimization is triggered on real workloads, but suspect that there are workloads where it is ineffective, and that it may be more or less effective on different trees. It may be useful to completely avoid this search in some situations, and I have indicated this with a todo, but don't have enough evidence to know when or how to do it yet.

(The benchmark workload is ineffective at triggering this optimization, but the forest_fuzz triggers it at all levels).

Coalesced compaction objects never set drop_tombstones to true. I am not sure the impact of this. I note that coalescing always follows the "copy" path in the compactor, never merge, move, or, "copy_drop_tombstones".

Coalesce is never performed for the last level. This is mostly just because the structure of the compaction setup code doesn't lend itself easily to coalescing the last level.

This adds a new constant, value_count_max_actual, which contains the maximum number of values allowed in a particular table. Contrast this with the existing value_count_max constant, which is sometimes a lower value. I don't really understand the logic around these two constants but the dual names are quite confusing and I'm open to suggestions about how to rename one of them.

Apr 21 '24 00:04 brson

I pushed an additional commit that asserts that coalescing resulted in a free table, and I'm interested in other assertions I should be making.

Apr 21 '24 02:04 brson

I'm still looking through this properly, but comparing with the benchmark client set to send in batches of 10 for 100k transactions, shows a massive space amp win from 8709MiB to 2332MiB!

Apr 22 '24 13:04 cb22

The value for lsm_manifest_coalesce_max_tables_to_search should originate at ConfigCluster since (iiuc) its choice ultimately impacts data layout on disk.

I have not yet been able to test on large levels - I have only tested up to level 3. This means I haven't ever run the code path where the search randomly selects the table index to begin searching from.

...Plus then you can pick a smaller value for the test_min config, which means the fuzzer will be able to hit interesting cases in a reasonable about of time/RAM.

(The benchmark workload is ineffective at triggering this optimization, but the forest_fuzz triggers it at all levels).

Is this because the benchmark creates full batches, which never need to be coalesced?

Apr 22 '24 20:04 sentientwaffle

(The benchmark workload is ineffective at triggering this optimization, but the forest_fuzz triggers it at all levels).

Is this because the benchmark creates full batches, which never need to be coalesced?

Yes.

Apr 23 '24 23:04 brson

I've resolved @matklad's reviews. Thanks! Have not resolved others yet.

Apr 25 '24 00:04 brson

We're going to try another approach to solving this problem.

May 08 '24 14:05 brson