F2 - Infinite compaction bug
When checking out a fresh copy from main, the following F2 tests fail (all related to RMW):
Test Duration Traits Error Message
HotColdTests/HotColdParameterizedTestParam.Rmw/0 [(2048, false, false)] Failed < 1 ms GoogleTestAdapterTestType [Parameterized] !! This test has probably CRASHED !! Test output: [0000.49112300]{ 0}::INFO::f2_test.cc:620: TEST: Index size: 2048 Auto compaction: 0 Use Read Cache: 0 [0000.49492600]{ 0}::DEBUG::persistent_memory_malloc.h:269: Log size: 192.000 MiB [0000.49524100]{ 0}::DEBUG::persistent_memory_malloc.h:270: Mutable fraction: 0.400 [0000.49530700]{ 0}::DEBUG::persistent_memory_malloc.h:271: Pre-allocate: FALSE [0000.49535100]{ 0}::DEBUG::persistent_memory_malloc.h:592: Num mutable_pages = 2 [0000.57404200]{ 0}::DEBUG::faster.h:160: Hash Index Size: 2048 [0000.191032100]{ 0}::DEBUG::faster.h:166: Read cache is DISABLED [0000.191085300]{ 0}::DEBUG::faster.h:171: Auto compaction is DISABLED [0000.191628700]{ 0}::DEBUG::persistent_memory_malloc.h:269: Log size: 192.000 MiB [0000.191655000]{ 0}::DEBUG::persistent_memory_malloc.h:270: Mutable fraction: 0.000 [0000.191666100]{ 0}::DEBUG::persistent_memory_malloc.h:271: Pre-allocate: FALSE [0000.191672100]{ 0}::DEBUG::persistent_memory_malloc.h:592: Num mutable_pages = 2 [0000.198695200]{ 0}::DEBUG::cold_index.h:79: ColdIndex will be stored @ test_f2_store\cold_\cold-index\ [0000.199561100]{ 0}::DEBUG::faster.h:160: Hash Index Size: 2048 [0000.199574300]{ 0}::DEBUG::cold_index.h:129: ColdIndex: 2048 HT entries -> 256 MiB in-memory [~153 MiB mutable] [0000.199899800]{ 0}::DEBUG::persistent_memory_malloc.h:269: Log size: 256.000 MiB [0000.199917800]{ 0}::DEBUG::persistent_memory_malloc.h:270: Mutable fraction: 0.600 [0000.199924900]{ 0}::DEBUG::persistent_memory_malloc.h:271: Pre-allocate: FALSE [0000.199930000]{ 0}::DEBUG::persistent_memory_malloc.h:592: Num mutable_pages = 4 [0000.210031700]{ 0}::DEBUG::faster.h:160: Hash Index Size: 2048 [0000.348919600]{ 0}::DEBUG::faster.h:166: Read cache is DISABLED [0000.348975000]{ 0}::DEBUG::faster.h:171: Auto compaction is DISABLED [0000.348985700]{ 0}::DEBUG::faster.h:166: Read cache is DISABLED [0000.348990000]{ 0}::DEBUG::faster.h:171: Auto compaction is DISABLED [0000.349001900]{ 0}::INFO::f2.h:224: Automatic F2 compaction is disabled [0028.2772776324]{ 0}::DEBUG::f2.h:750: Compact HOT: {0.07 GB} {Goal 0.07 GB} [64 79606064] -> [64 79606064]
HotColdTests/HotColdParameterizedTestParam.Rmw/1 [(2048, false, true)] Failed Stale 582 ms GoogleTestAdapterTestType [Parameterized] Expected equality of these values: Status::Ok Which is: \0 result Which is:
HotColdTests/HotColdParameterizedTestParam.Rmw/2 [(2048, true, false)] Failed Stale 575 ms GoogleTestAdapterTestType [Parameterized] Expected equality of these values: Status::Ok Which is: \0 result Which is:
HotColdTests/HotColdParameterizedTestParam.Rmw/3 [(2048, true, true)] Failed Stale 584 ms GoogleTestAdapterTestType [Parameterized] Expected equality of these values: Status::Ok Which is: \0 result Which is:
HotColdTests/HotColdParameterizedTestParam.Rmw/4 [(4194304, false, false)] Failed Stale 653 ms GoogleTestAdapterTestType [Parameterized] Expected equality of these values: Status::Ok Which is: \0 result Which is:
HotColdTests/HotColdParameterizedTestParam.Rmw/5 [(4194304, false, true)] Failed Stale 658 ms GoogleTestAdapterTestType [Parameterized] Expected equality of these values: Status::Ok Which is: \0 result Which is:
HotColdTests/HotColdParameterizedTestParam.Rmw/6 [(4194304, true, false)] Failed Stale 644 ms GoogleTestAdapterTestType [Parameterized] Expected equality of these values: Status::Ok Which is: \0 result Which is:
HotColdTests/HotColdParameterizedTestParam.Rmw/7 [(4194304, true, true)] Failed Stale 666 ms GoogleTestAdapterTestType [Parameterized] Expected equality of these values: Status::Ok Which is: \0 result Which is:
Test Duration Traits Error Message
CompactLookupOnDiskTests/CompactLookupParameterizedOnDiskTestFixture.OnDiskRmw/0 [(true, false, 1)] Failed Stale 27.8 sec GoogleTestAdapterTestType [Parameterized] Expected equality of these values: Status::Ok Which is: \0 result Which is:
CompactLookupOnDiskTests/CompactLookupParameterizedOnDiskTestFixture.OnDiskRmw/1 [(true, false, 8)] Failed Stale 189 ms GoogleTestAdapterTestType [Parameterized] Expected equality of these values: Status::Ok Which is: \0 result Which is:
CompactLookupOnDiskTests/CompactLookupParameterizedOnDiskTestFixture.OnDiskRmw/2 [(true, true, 8)] Failed Stale 193 ms GoogleTestAdapterTestType [Parameterized] Expected equality of these values: Status::Ok Which is: \0 result Which is:
I modified one of the tests at the line below to try to help it pass. It seems that the purpose of the test was to prove that N bytes of records stayed in memory until more RMW ops cycled them out. The test was failing because the ops were immediately pending and not OK:
https://github.com/microsoft/FASTER/blob/7f71289fc1e5a188e90a50f26994fb0d88657dc0/cc/test/f2_test.cc#L640
I changed the value to 10,000.
Re-running the test with the lower value causes an infinite compaction loop that I have also observed in my own benchmarking of F2. I really don't know the deep inner mechanics yet but these are my observations:
- In faster.h,
pages_availableis always true. - In LightEpoch, the calculation to determine the next safe epoch to flush becomes disconnected from the pending operations. When this happens, there's always a
PENDING_GCand sometimes aRESTstatus operation with an epoch > the safe minimum epoch that can be reclaimed.
To expand on the second point, the calculation for the safe epoch becomes stuck, never advances, and the final pending items, which appear to have their final completed status set to pending garbage collection, remain permanently out of reach of LightEpoch and cannot be completed. I suspect this is why the pages_available loop happens.
I wish I could submit a PR, but these are deep / central mechanics and I'm not sure how long it will be before I can solve this myself.
TL;DR - compaction can be triggered in a state that doesn't change the current page, and current compaction logic explicitly requires it to change, otherwise the retry loop runs forever.
Did some debugging. It seems that compaction can be triggered in a state where the head and tail in the hot log are not advanced. The compaction process spin-loops, explicitly waiting for them to change. File creation is only triggered when they do change. So, even if you modify the spin loop to break free, it wants to proceed and trigger the file IO for the page, but the files have not yet been created.
Solution appears to be to:
-
Stop breaking from the the loop here: https://github.com/microsoft/FASTER/blob/7f71289fc1e5a188e90a50f26994fb0d88657dc0/cc/src/core/log_scan.h#L652
-
Instead, do this:
if (tail_address >= until_) { tail_address = until_; }
This allows the rest of the loop to proceed, where the logic ensures that the page is in the correct state.
Then, where you have assert(files):
https://github.com/microsoft/FASTER/blob/7f71289fc1e5a188e90a50f26994fb0d88657dc0/cc/src/device/file_system_disk.h#L428
You need to create if not exists. This is a quirk of the fix in the context of the broader file creation logic. The creation logic is currently only triggered when the page changes. Anyway, you're already holding the xlock on the file already here so you can just create if not exists like in the other context code.
bundle_t* files = files_.load();
if (!files)
{
// First segment opened.
void* buffer = std::malloc(bundle_t::size(1));
bundle_t* new_files = new(buffer) bundle_t{ filename_, file_options_, handler_,
begin_segment_, begin_segment_ + 1 };
files_.store(new_files);
files = files_.load();
newFiles = true;
}
assert(files);
if(!files->begin_segment >= new_begin_segment) {
// Segments have already been truncated.
gc_state.IssueTruncateCallback(files->begin_segment * kSegmentSize);
return;
}
I then also modified the failing tests to ensure that the record count doesn't immediately cause a spill to disk, as they were intended, and now they pass. I duplicated them and made versions that do immediately spill to disk, and they pass too. All tests pass.
I'm honestly not sure why those tests cause a spill when they shouldn't. I double-checked the math (record size, index size etc) and it does appear on a surface level that the record count used should not spill, but it does. Maybe it's a result of other, deeper math (like the fact that the mutable region in these tests is only 40%). Not sure.
Anyway, will submit a PR, though I'm sure you'll want to tweak my fix.
This doesn't solve the problem in all scenarios. I still encounter a deadlock where the epoch drain list gets disconnected and never catches up. There's some sort of subtle concurrency bug with the atomics that my patch is obviously only addressing one of the symptoms of.
It does have to do with re-entrancy and it appears that the reentrant functions are unused in light epoch, if that helps at all. Unfortunately I'd have to study the entire system to fix this.
cc @kkanellis
@TechnikEmpire thank you for opening this issue and for your initial investigation and comments -- this is very helpful!
Let me take a deeper look at this and will report back.
@TechnikEmpire thank you for opening this issue and for your initial investigation and comments -- this is very helpful!
Let me take a deeper look at this and will report back.
No problem, sorry I couldn't be of more help. FWIW, after applying my patch, I was able still repro the drain disconnect by just doing concurrent store writes in Google Benchmark. I had coded out a var-length store atop F2 and my write benchmark would hit this.
Off topic but, also, if you profile F2 in vs2022, you'll find a couple of if statements in the internals that are really hot. They seem to introduce a big problem for branch prediction. If you replace them with constexpr jump tables, the general perf goes up. I'll submit a patch for that once this is worked out. Good luck and thanks for your OSS work.
@TechnikEmpire I was able to reproduce this bug locally by setting the number of records in the Rmw F2 test to 10,000, as you did. The root cause appears to be invoking compaction with an empty range (i.e., until_address == hlog.begin_address), which normally should not happen.
I’ve opened a draft PR with a small fix that resolves the issue in my environment. If you could try it out and confirm whether it works for you as well, that would be very helpful.
Separately, I’m also curious about the problem you encountered when running the Rmw F2 test with 20,000 records. I haven’t been able to reproduce that behavior — could you share more details about your environment?
@kkanellis So sorry I didn't reply, I'm not sure how I missed this. I ended up developing a custom system for my use-case and replacing F2, but I'll pull your fixes now and give it a shot. Thanks for looking into this 🙏