CacheLib icon indicating copy to clipboard operation
CacheLib copied to clipboard

Shorten critical section in findEviction

Open igchor opened this issue 3 years ago • 24 comments

Remove the item from mmContainer and drop the lock before attempting eviction.

The change improves throughput for default hit_ratio/graph_cache_leader_fbobj config by ~30%. It also reduces p99 latencies significantly. The improvement is even bigger for multi-tier approach (multiple memory tiers) which we are working on here: https://github.com/pmem/CacheLib

I was not able to find any races/synchronization problems with this approach but there is a good chance I missed something - it would be great if you could review and evaluate this patch.

igchor avatar Apr 15 '22 09:04 igchor

The implementation in first commit suffers from a few problems:

  • the item is added back to the MMContainer if eviction fails (becomes "hot")
  • the item can potentially be destroyed after dropping MMContainer lock (?)
  • there might be some races with SlabRelease

One idea we have for solving this issues is presented in second patch.

It increments ref count of the item before dropping the lock, which synchronizes with other threads doing eviction and prevents item destruction.

igchor avatar Apr 28 '22 16:04 igchor

@igchor: Hi, I will be reviewing this week. In the meanwhile, could you run the following cachebench tests? These test the consistency of item values that involve chained items.

My concern is around prematurely dropping the eviction lock for a regular item when a concurrent chained item eviction is happening, and vice-versa. An additional concern is around the interaction between Slab-release and eviction code path. I will focus on these during my review.

`cachelib/cachebench/test_configs/consistency/chained-items-stress-moving.json`
`cachelib/cachebench/test_configs/consistency/chained-items-stress-no-moving.json`
`cachelib/cachebench/test_configs/consistency/ram-with-deletes.json`

therealgymmy avatar May 23 '22 17:05 therealgymmy

@therealgymmy So, the only needed fix for correctness is to change the memory order on reads (at least for non-chained items)? Why the relaxed memory order is enough right now?

Also, here is my rationale for why 4th and 5th cases should work (perhaps after fixing memory order):

In my patch, if the item is removed from the LRU, then it must have also been removed from AccessContainer. This means that the item cannot be moved due to this check: https://github.com/facebook/CacheLib/blob/57c8bd34003fc71bbacef1fcce4e1d12fd5d09e6/cachelib/allocator/CacheAllocator-inl.h#L1075 . It can only be evicted.

If the item was already freed in findEviction (before any operation in SlabRelease), markMoving() will not be set (we see that the item is freed) and so, we can just finish.

There could be a problem when slabRlease calls markMoving() and then evictForSlabRelease just before: https://github.com/facebook/CacheLib/blob/57c8bd34003fc71bbacef1fcce4e1d12fd5d09e6/cachelib/allocator/CacheAllocator-inl.h#L1267. But at that time, the element is already removed from LRU, causing markMoving() to fail.

Also, if markMoving() is called after the element is removed from AccessContainer and before it is removed from LRU (inside tryEvictRegularItem) then the thread which is doing findEvition, will exit before freeing the item (because markMoving is set).

Are there any issues with my reasoning?

igchor avatar May 25 '22 21:05 igchor

the only needed fix for correctness is to change the memory order on reads (at least for non-chained items)? Why the relaxed memory order is enough right now?

@igchor: please ignore my comments on the mem ordering. I overlooked that after we remove from mm-container, we unmark an item's kLinked bit with acq_rel ordering, which will force the "eviction thread" and "slab release thread" to be correctly ordered.

My concern previously was this:

  1. slab release thread marks an item as moving as it sees item's kLinked bit is still marked
  2. eviction thread sees the item moving bit UNmarked (reading a stale value from (1))

I thought it could happen because L2397 (markMovingForSlabRelease) will mark the bit with acq_rel but L1370 (if (evictHandle->isMoving())) reads it with relaxed. However, the code in L1360 mmContainer.remove(itr) calls Refcount::unmarkInMMContainer() which unmarks the kLinked bit with acq_rel. This means L1360 and L2397 must be ordered correctly and thus L1370 which sequences-after L1360 must be ordered correctly with L2397 as well.

therealgymmy avatar Jun 01 '22 00:06 therealgymmy

@therealgymmy has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot avatar Jun 01 '22 00:06 facebook-github-bot

@igchor: can you take care of the comments and rebase this PR to latest version? I will try importing it again and run through the internal tests.

therealgymmy avatar Jun 01 '22 16:06 therealgymmy

@igchor has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot avatar Jun 02 '22 12:06 facebook-github-bot

I've responded to comments, run the tests you suggested and reworked the code. I realized there was one other problem: incrementing ref count under the mmContainer lock was causing races with replaceIf/removeIf. I changed the code to increment the ref count under AC lock so that it properly synchronizes with predicate evaluation inside replaceIf/removeIf.

I belive that incrementing refCount without taking the lock is possible but would require changes in other places in the code. Performance impact of taking the AC lock is pretty small (~5% in terms of throughput) for hit_ratio benchmarks. Performance after this patch is still better than performance of the main branch.

igchor avatar Jun 02 '22 13:06 igchor

@therealgymmy has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot avatar Jun 06 '22 22:06 facebook-github-bot

@igchor: the following test failed:

cachelib/allocator/tests:cache-allocator-test - BaseAllocatorTest/3.AddChainedItemMultithread

It failed on this line: https://github.com/facebook/CacheLib/blob/main/cachelib/allocator/tests/BaseAllocatorTest.h#L4522

With the following output:

buck-out/dev/gen/aab7ed39/cachelib/allocator/tests/cache-allocator-test#platform010-clang,private-headers/cachelib/allocator/tests/BaseAllocatorTest.h:4522
Expected: (nullptr) != (childItem), actual: (nullptr) vs nullptr
buck-out/dev/gen/aab7ed39/cachelib/allocator/tests/cache-allocator-test#platform010-clang,private-headers/cachelib/allocator/tests/BaseAllocatorTest.h:4522
Expected: (nullptr) != (childItem), actual: (nullptr) vs nullptr
buck-out/dev/gen/aab7ed39/cachelib/allocator/tests/cache-allocator-test#platform010-clang,private-headers/cachelib/allocator/tests/BaseAllocatorTest.h:4522
Expected: (nullptr) != (childItem), actual: (nullptr) vs nullptr

This wasn't a crash but that we couldn't allocate a new chained item even after 100 attempts. This shouldn't be possible since we have 3 allocating thread and each thread could hold 11 outstanding item handles at max (1 parent + 10 chained items). By default we walk the bottom 50 items of an eviction queue and evict an item without any refcount, so each of the allocating thread should always succeed in allocating an item.


This also triggered failures in a number of downstream tests from internal services that depend on cachelib. I skimmed over those and they all seem related to allocating chained items.


I changed the code to increment the ref count under AC lock

Did you mean under the hash-table lock? Seems like the change is to get an item handle by calling find()?

auto toReleaseHandle = accessContainer_->find(*candidate);

therealgymmy avatar Jun 07 '22 00:06 therealgymmy

@therealgymmy Thank you for the feedback!

I managed to reproduce the issue locally and will work on fixing it.

Did you mean under the hash-table lock? Seems like the change is to get an item handle by calling find()?

Yes, under the hash-table (Access Container) lock. I actually realized that instead of increasing the refCount (which requires this hash-table lock) we could just mark an item as moving in findEviction (if not already marked) This would prevent any other thread from releasing this item and we could even reuse the existing evictNormalItemForSlabRelease method (for ChainedItems we would still need separate code path).

I have just one concern/question regarding this approach. In current implementation, is it possible for multiple threads to mark the same item as moving? (I don't see any check for that in markMoving()) If not, how is this prevented? Is there an assumption that only one thread can do SlabRelease at any point?

igchor avatar Jun 09 '22 08:06 igchor

Hi Igor,

I have just one concern/question regarding this approach. In current implementation, is it possible for multiple threads to mark the same item as moving?

Only one thread can mark the same item as moving. Multiple threads can be releasing slabs, but the same slab can only be released by one thread.

Thanks, Jimmy


From: Igor Chorążewicz @.> Sent: Thursday, June 9, 2022 1:58 AM To: facebook/CacheLib @.> Cc: Jimmy Lu @.>; Mention @.> Subject: Re: [facebook/CacheLib] RFC: Shorten critical section in findEviction (PR #132)

@therealgymmyhttps://github.com/therealgymmy Thank you for the feedback!

I managed to reproduce the issue locally and will work on fixing it.

Did you mean under the hash-table lock? Seems like the change is to get an item handle by calling find()?

Yes, under the hash-table (Access Container) lock. I actually realized that instead of increasing the refCount (which requires this hash-table lock) we could just mark an item as moving in findEviction (if not already marked) This would prevent any other thread from releasing this item and we could even reuse the existing evictNormalItemForSlabRelease method (for ChainedItems we would still need separate code path).

I have just one concern/question regarding this approach. In current implementation, is it possible for multiple threads to mark the same item as moving? (I don't see any check for that in markMoving()) If not, how is this prevented? Is there an assumption that only one thread can do SlabRelease at any point?

— Reply to this email directly, view it on GitHubhttps://github.com/facebook/CacheLib/pull/132#issuecomment-1150858437, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAFDBVOP2SCM3Q2MFLDRUO3VOGW4ZANCNFSM5TP7W4OQ. You are receiving this because you were mentioned.Message ID: @.***>

therealgymmy avatar Jun 10 '22 23:06 therealgymmy

@igchor has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot avatar Jun 13 '22 17:06 facebook-github-bot

@igchor has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot avatar Jun 13 '22 17:06 facebook-github-bot

@igchor has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot avatar Jun 13 '22 17:06 facebook-github-bot

@therealgymmy I found out what the issue was: I was passing candidate to releaseBackToAllocator which after my changes represented the parent Item. We should pass pointer to the child we actually want to recycle instead.

Also, instead of increasing refCount of the item I now rely on moving flag. I changed the markMoving function to only succeed if the item is not yet marked as moving. This guarantees proper synchronization with other evicting threads and with SlabRelease thread. Once the item is marked as moving inside findEviction we can just execute the function which is used in SlabReleaseImpl for regular items - this removes some of the code duplication.

If the eviction fails (due to refCount being != 0) we unmark the item and check the flags. If flags are 0 this means that item is not used anywhere (and was not freed by any other thread since it was marked moving) and removed from AC and MMContainer by other thread so we can recycle it anyway.

After I implemented the above change I also realized it will be quite easy to use combined locking for eviction iterator which I also implemented (I took the idea from https://github.com/facebook/CacheLib/blob/main/cachelib/allocator/MM2Q-inl.h#L256). The performance results are pretty good now, we can share the exact numbers with you once we finish running all benchmarks.

igchor avatar Jun 13 '22 17:06 igchor

@therealgymmy has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot avatar Jun 16 '22 18:06 facebook-github-bot

@igchor has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot avatar Jun 20 '22 15:06 facebook-github-bot

@igchor has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot avatar Jun 20 '22 15:06 facebook-github-bot

@therealgymmy has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot avatar Jun 20 '22 16:06 facebook-github-bot

@igchor has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot avatar Jun 24 '22 16:06 facebook-github-bot

@igchor has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot avatar Jun 24 '22 17:06 facebook-github-bot

@igchor: cleaning up the slab release logic can be done in a separate PR.

I will need to run this change on a shadow setup for some of our production services. Once I verify the shadows run correctly. We'll be sufficiently confident to approve this PR and merge it in.

therealgymmy avatar Jun 30 '22 17:06 therealgymmy

@therealgymmy has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot avatar Jun 30 '22 18:06 facebook-github-bot

@therealgymmy I have opened a new PR: https://github.com/facebook/CacheLib/pull/166 which basically a subset of this one. It would be great if you could take a look.

igchor avatar Oct 11 '22 21:10 igchor