kvrocks Enhance the implementation of Random Sample Keys in kvrocks

trafficstars

Search before asking

[X] I had searched in the issues and found no similar issues.

Motivation

Random sample keys is supported in kvrocks. When running commands like spop, scanning all data would be a bottleneck. Currently, the implementation is:

Get all sampled values
Random filtering K values
return

Unlike redis, it's complexity is always O(N), N is the element cardinal, which is extremly high. It's a bit hard to get "random" in kvrocks because we're based on rocksdb.

Solution

There're some rules we can apply:

Maintaining a eq-depth histogram if random is frequently. This is the best one if random sample is frequently. However, we'll suffer from maintaining cost during write. TiKV using size based sampling to maintain the split range

So, instead, maybe we can enhance the implementation of current impl. For example:

Counting all "indices" of random values, and sorting them
Using iterator to get only these indices

The complexity is still O(N), but the performance might be enhanced.

Are you willing to submit a PR?

[x] I'm willing to submit a PR!

Mar 09 '24 14:03 mapleFU

I'm trying to implement this. During my impl, I found a problem that:

Get random set size
Get all indices, counting the random fetching indices
Traverse the set

The command 1-3 should be done in one snapshot, otherwise, the case might be:

Get all indices, size == 10
Random need to access 10th member
User delete an element
10th cannot be fetched

So, GetMetadata and traversing the set should be done in one rocksdb snapshot. Would you think this is safe and ok? @git-hulk @PragmaTwice

Mar 11 '24 11:03 mapleFU

I'm wondering if it'd be better to sacrifice the scope of the candidate set to simplify the implementation like below:

if size <= N(e.g. 128/256), then iterate all members and randomly pick one of them.
if size > N, random among the first N elements if the cached random point is empty. Otherwise, seek from the cached random point and then random among the next N elements, then cache the random point again or remove the cache random point if it reaches the end.

Mar 11 '24 12:03 git-hulk

if size > N, random among the first N elements if the cached random point is empty. Otherwise, seek from the cached random point and then random among the next N elements, then cache the random point again or remove the cache random point if it reaches the end.

I thought of using this previously, and this is named "statistics" in this issue. but where should we storing this and how do updating this?

Mar 11 '24 12:03 mapleFU

but where should we storing this and how do updating this?

We're now using a LRU to cache the iterator key while scanning the DB/hash/set, maybe we can do this in the same way.

Mar 11 '24 12:03 git-hulk

That's a bit complex 🤔, in my current impl I may just need to iter the set with one snapshot...

Mar 11 '24 12:03 mapleFU

That's a bit complex 🤔, in my current impl I may just need to iter the set with one snapshot...

Yes, it should be more complex than the current implementation. Then I think the implementation with the same snapshot is good.

Mar 11 '24 12:03 git-hulk

Yeah, another point is that support get meta by snapshot can enhance our isolation. It might slightly affect LSM-Tree Garbage collection, but I think it's ok

Mar 12 '24 07:03 mapleFU

Aha: https://github.com/apache/kvrocks/commit/95301c5a673af83ad85e6dfb63da4db41d83afb8#diff-4bae2c37c513c915c0528134abd47edff2a52833806777feb088b6d09990a74e

@git-hulk should we sample at most 60 keys in spop?

Mar 21 '24 13:03 mapleFU

@mapleFU For Kvrocks, I feel good to only randomize part of them if there're too many keys.

Mar 21 '24 14:03 git-hulk

kvrocks kvrocks copied to clipboard

Enhance the implementation of Random Sample Keys in kvrocks

Search before asking

Motivation

Solution

Are you willing to submit a PR?

kvrocks
kvrocks copied to clipboard