Feature request: "Multi" prefix extractor support
say my key format is <account_id>:<user_id>:<some dynamic value>
today, we can create a prefix extractor/bloom on <account_id>:<user_id> to help with queries that start with some known <account_id>:<user_id>, HOWEVER, what we can't do today is ALSO setup a prefix extractor on <account_id> this way, I can use bloom filters on queries that happen to know the account id + user id combination as well as the queries that only happen to have an account id. Effectively, in db/sql terminology, this is like being able to create multiple indexes on the "columns" to optimize queries like: select * from blah where account_id = 123 & select * from blah where account_id = 345 and user_id = 678
As far as I know, today we can only have one prefix extractor/bloom per cf so we have the following workarounds which are not ideal:
-
create another cf that duplicates the data, so that one cf has
<account_id>:<user_id>prefix extractor and the other has<account_id>prefix extractor and depending on the query/what we already know, we will lookup the kv from the corresponding cf. The issue here is we need to use more disk space to store the duplicate data -
Given
<account_id>is common between both prefix extractors (in this use case) and we always have this, we use this as the prefix extractor, however, we miss on the opportunity to optimize queries that also have<user_id>
looks like something similar was requested https://groups.google.com/g/rocksdb/c/bb6Db8Y3xwU
@ajkr What do you think about a feature like this? It seems like it's very useful/high impact, but i'm not sure the level of effort is?
Can @pdillinger's key segment filtering (https://github.com/facebook/rocksdb/blob/110ce5f4a392d02167cee3439160f83d2929a2c8/include/rocksdb/experimental.h#L64-L163, #12075) be used for this purpose?
oh interesting, I didn't know this exists, I'll take a closer look at how this works. Is this being used in production right now anywhere? Any gotchas?
So reading this:
To simplify satisfying some filtering requirements, the segments must encompass a complete key prefix (or the whole key) and segments cannot overlap.
Specifically, the segments cannot overlap part means this won't work for my use case (unless I'm misunderstanding). So to use the terminology being used here, given a key of the form <account_id>:<user_id>:<some dynamic value>, I would like to create two segments for filtering: <account_id> & <account_id>:<user_id> given that both segments share the <account_id> part, this means the two segments are "overlapping" and therefore are not allowed right now?
or.. actually maybe the whole point is to use the category concept? So I can have one category that contains two segments:
<account_id> & <user_id> and then I can do the filtering by "category" to satisfy queries like: select * from blah where account_id = 345 and user_id = 678 or I can do the filtering by "segment" (specifically the account_id segment) to satisfy queries like select * from blah where account_id = 123?
also per https://github.com/facebook/rocksdb/blob/v9.3.1/include/rocksdb/experimental.h#L334-L335 how does the filter that is being used here compare to bloom/ribbon perf wise, etc.. any benchmarks, etc..?
I think this is exactly what I need but would love more examples and I will likely wait until bloom/ribbon filters are supported
The API and functionality is not yet complete for the filtering you want, but the KeySegmentsExtractor API is intended to be complete.
Specifically, the segments cannot overlap part means this won't work for my use case (unless I'm misunderstanding)
You want a segment for each field in your key. This should be stable regardless of your desired filtering strategy (except when you extend or replace your key schema). You want a Bloom/ribbon filter on SelectKeySegment(0) and a Bloom/ribbon filter on SelectKeySegmentRange(0,1). Creating Bloom/ribbon filters is not yet available in the API:
https://github.com/facebook/rocksdb/blob/9.2.fb/include/rocksdb/experimental.h#L334-L335
got it, thanks for confirming! It's great that what I'm looking for is being worked on. Is there an existing issue that tracks the rest of this work that I can track or should I just keep this issue open?
@pdillinger sorry, just wanted to confirm,, I might have made an incorrect assumption. Based on what you said: The API and functionality is not yet complete for the filtering you want,, Is it safe to assume the filtering I want is planned to be done or is this not something that is planned/being prioritized?
I am also very much interested in this feature as I have a similar use case.