kvrocks icon indicating copy to clipboard operation
kvrocks copied to clipboard

Enhance the implementation of Random Sample Keys in kvrocks

Open mapleFU opened this issue 1 year ago • 10 comments
trafficstars

Search before asking

  • [X] I had searched in the issues and found no similar issues.

Motivation

Random sample keys is supported in kvrocks. When running commands like spop, scanning all data would be a bottleneck. Currently, the implementation is:

  1. Get all sampled values
  2. Random filtering K values
  3. return

Unlike redis, it's complexity is always O(N), N is the element cardinal, which is extremly high. It's a bit hard to get "random" in kvrocks because we're based on rocksdb.

Solution

There're some rules we can apply:

  1. Maintaining a eq-depth histogram if random is frequently. This is the best one if random sample is frequently. However, we'll suffer from maintaining cost during write. TiKV using size based sampling to maintain the split range

So, instead, maybe we can enhance the implementation of current impl. For example:

  1. Counting all "indices" of random values, and sorting them
  2. Using iterator to get only these indices

The complexity is still O(N), but the performance might be enhanced.

Are you willing to submit a PR?

  • [x] I'm willing to submit a PR!

mapleFU avatar Mar 09 '24 14:03 mapleFU

I'm trying to implement this. During my impl, I found a problem that:

  1. Get random set size
  2. Get all indices, counting the random fetching indices
  3. Traverse the set

The command 1-3 should be done in one snapshot, otherwise, the case might be:

  • Get all indices, size == 10
  • Random need to access 10th member
  • User delete an element
  • 10th cannot be fetched

So, GetMetadata and traversing the set should be done in one rocksdb snapshot. Would you think this is safe and ok? @git-hulk @PragmaTwice

mapleFU avatar Mar 11 '24 11:03 mapleFU

I'm wondering if it'd be better to sacrifice the scope of the candidate set to simplify the implementation like below:

  1. if size <= N(e.g. 128/256), then iterate all members and randomly pick one of them.
  2. if size > N, random among the first N elements if the cached random point is empty. Otherwise, seek from the cached random point and then random among the next N elements, then cache the random point again or remove the cache random point if it reaches the end.

git-hulk avatar Mar 11 '24 12:03 git-hulk

if size > N, random among the first N elements if the cached random point is empty. Otherwise, seek from the cached random point and then random among the next N elements, then cache the random point again or remove the cache random point if it reaches the end.

I thought of using this previously, and this is named "statistics" in this issue. but where should we storing this and how do updating this?

mapleFU avatar Mar 11 '24 12:03 mapleFU

but where should we storing this and how do updating this?

We're now using a LRU to cache the iterator key while scanning the DB/hash/set, maybe we can do this in the same way.

git-hulk avatar Mar 11 '24 12:03 git-hulk

That's a bit complex 🤔, in my current impl I may just need to iter the set with one snapshot...

mapleFU avatar Mar 11 '24 12:03 mapleFU

That's a bit complex 🤔, in my current impl I may just need to iter the set with one snapshot...

Yes, it should be more complex than the current implementation. Then I think the implementation with the same snapshot is good.

git-hulk avatar Mar 11 '24 12:03 git-hulk

Yeah, another point is that support get meta by snapshot can enhance our isolation. It might slightly affect LSM-Tree Garbage collection, but I think it's ok

mapleFU avatar Mar 12 '24 07:03 mapleFU

Aha: https://github.com/apache/kvrocks/commit/95301c5a673af83ad85e6dfb63da4db41d83afb8#diff-4bae2c37c513c915c0528134abd47edff2a52833806777feb088b6d09990a74e

@git-hulk should we sample at most 60 keys in spop?

mapleFU avatar Mar 21 '24 13:03 mapleFU

@mapleFU For Kvrocks, I feel good to only randomize part of them if there're too many keys.

git-hulk avatar Mar 21 '24 14:03 git-hulk