valkey
valkey copied to clipboard
[NEW] Valkey-Bloom: BloomFilter support for Valkey.
The problem/use-case that the feature addresses
Bloom filters are a space efficient probabilistic data structure that can be used to “check” whether an element exists in a set (with a defined false positive), and to “add” elements to a set. While checking whether an item exists, false positives are possible, but false negatives are not possible. https://en.wikipedia.org/wiki/Bloom_filter
Description of the feature
Valkey-Bloom is a Rust Valkey-Module which brings a native and space efficient probabilistic Module data type to Valkey. With this, users can create filters (space-efficient probabilistic Module data type) to add elements, perform “check” operation to test whether an element exists, check cardinality / INFO, auto scale their filters, reserve filters, perform RDB Save and load operations, etc.
Valkey-Bloom is built using bloomfilter::Bloom
(https://crates.io/crates/bloomfilter which has a BSD-2-Clause license).
It is compatible with the BloomFilter (BF.*
) command APIs of redislabs/rebloom from Redis Ltd. which has over 10M image pulls on Docker and is compatible with several client libraries.
The following commands are supported.
BF.EXISTS
BF.ADD
BF.MEXISTS
BF.MADD
BF.CARD
BF.RESERVE
BF.INFO
BF.INSERT
We would like to bring Valkey-Bloom into the valkey-io project as an open source Valkey-Module that is free to use, contribute to, etc.
Alternatives you've considered
A bloom filter module does exist today for Redis - https://github.com/goodform/rebloom. However, it uses an AGPL-3.0 license which has additional obligations that are are difficult to meet for many of the active contributors who are looking to provide Valkey as a service. AGPL is also widely disallowed by company open source program offices (including Amazon). Given that this package has not been significantly modified since it was created six year ago, it seems likely that the license is part of the issue.
@KarthikSubbarao we are continuing the goodform.io modules as native valkey modules too. Personally I don't think the lack of activity relates to the license - it's more that the code is essentially done and that all modules generally get little attention once mature - but we're just speculating here.
Can we find a way to co-exist? I have used naming like valkey-bloom (all lower case) and the module shared library valkeybloom.so for a simple transition for users (this module will be in Fedora soon with this naming convention as we transition away from Redis). This matches up with the other goodform.io modules like valkey-search, valkey-json, valkey-graph, and so on.
Would it be possible to name this new module in a way that highlights the differences perhaps? (e.g. Valkey-Bloom-Rust?)
Can we find a way to co-exist?
Given your precedence, I think we shouldn't overwrite your naming. If you want to translate the names to valkey-*, I think we should respect that.
Would it be possible to name this new module in a way that highlights the differences perhaps? (e.g. Valkey-Bloom-Rust?)
We could call it Val-Bloom
or something, more similar to how Redis was naming. Or we could name it based on the probability. Based on reading the docs (I've been historically advised not to read AGPL code while working in an AWS capacity), the rebloom only supports the Bloom data types and not any of the newer ones supported by Redis (like Top-K or Cookoo). I don't know how popular any of those are though.
Thanks @KarthikSubbarao for creating this.
This is one of the most popular modules and I've seen users used various alternatives like lua scripts, custom application around BITSET
command when the prior modules weren't accessible (due to licensing). I believe it would be good if Valkey organization can make it part of the project.
Key questions :
- How do we bundle modules? Should it be part of the binary/containers/release(s) by default?
- Integration tests? Each module having their own testing framework might make it difficult for maintenance over the years. I would rather prefer continuing with TCL tests or introduce new lightweight framework and use it for each modules.
@hpatro there is an existing python-based test framework (BSD licensed) from the early days that has been kept and used with all of the goodform modules. The earlier version is named 'rmtest' (Redis Modules Test) and I've been working on transitioning it to 'vkmtest' (ValKey Modules Test). Maybe it'll work for the Rust module testing too - you can find the initial version here: https://github.com/goodform/valkey-module-test
@natoscott That is something I am very interested in taking over (specifically because I want a python based testing framework for the main project) if you have any interest in offloading the maintenance of it. Ideally it could be re-usable across all projects that run Valkey (or Redis even).
How do we bundle modules? Should it be part of the binary/containers/release(s) by default?
This isn't the question we should answer here. Can you make a separate issue for it?
@madolson happy to either work with you on it or have you take it over - I have alot on my plate (as I'm sure you do!) but I can definitely still dedicate some time to it. This test framework is also packaged in Fedora and I'd like to upload it to pypi for ease of use within the Valkey modules too.
@KarthikSubbarao another possibility if you're super keen on ValkeyBloom and not something with 'Rust' in the name would be for me to use valkey-module-bloom for the existing modules. In hindsight I see I've used that prefix for -test and -sdk (python and C respectively) and that convention could be used on the C modules also perhaps? Anyway, let me know your thoughts, I'm happy to change it at this early stage. There was also mention of a new implementation of ValkeyJSON (not sure if its using Rust) from someone at Alibaba IIRC - so this naming issue may not be an isolated problem.
happy to either work with you on it or have you take it over
Cool! Not an immediate something to figure out, but would love to collaborate on this.
Thanks @KarthikSubbarao for creating this.
This is one of the most popular modules and I've seen users used various alternatives like lua scripts, custom application around
BITSET
command when the prior modules weren't accessible (due to licensing). I believe it would be good if Valkey organization can make it part of the project.Key questions :
- How do we bundle modules? Should it be part of the binary/containers/release(s) by default?
- Integration tests? Each module having their own testing framework might make it difficult for maintenance over the years. I would rather prefer continuing with TCL tests or introduce new lightweight framework and use it for each modules.
Here we are https://github.com/valkey-io/valkey/issues/408
Would it be possible to name this new module in a way that highlights the differences perhaps? (e.g. Valkey-Bloom-Rust?)
I like a name that highlights the differences in behavior but not one that gives the slightest hint about how it is implemented.
@valkey-io/core-team I guess maybe ask for a vote if we want to adopt this and continue developing it as an official bloom module? This is not committing to a specific date for when we will release it, just to start the ball rolling for a module based distribution.
Some things to consider. There are other modules like the good form modules. I believe Alibaba also has a module that implements bloom that they have not open sourced.
Regarding naming, I though we had sort-of decided to reserve the valkey
prefix for official modules and clients. OTOH, we agreed that the license precondition to become official is that it's open source / free software, which AGPL is, although cloud vendors and other enterprises don't like it. :)
Anyhow, I hope both can co-exist and that they're made API compatible. In that way, users don't need to worry about the differences running on their distro vs running against a hosted database as a service.
@KarthikSubbarao How complete is your module?
I'm fine with adding it, if you (or anyone else) promises to maintain it actively.
My name suggestion is "ValkeyBF", picking up the BF
. prefix used in the command names.
@zuiderkwast I think @KarthikSubbarao's bloom filter is licensed under BSD 3-clause and it is the one being proposed here. My vote is yes on the same conditions as Viktor mentioned above: 1) full command compat; 2) active maintenance. Name wise, my preference would be Valkey-Bloom. BF
is too short IMO and I would also prefer a dash after Valkey
@PingXie Do you want to clarify what you mean with 1) full command compat;
. I think right now there is not full command compatibility, since only some of the commands are implemented. Do you just mean that the APIs that do exist are compatible?
@zuiderkwast I think @KarthikSubbarao's bloom filter is licensed under BSD 3-clause and it is the one being proposed here.
Yes it is, but we're also discussing the already-exising AGPL "valkey-bloom" module here.
@natoscott It's good that you're willing to name the AGPL module "valkey-module-bloom" and we can name the BSD licensed module "Valkey-Bloom". That's no collission.
Even if we allow projects under the Valkey unbrella to be AGPL, it might be good to avoid it for modules that are to be included in the "Valkey+" (name TBD) package, which will be a container containing Valkey + some official modules.
@PingXie Do you want to clarify what you mean with
1) full command compat;
. I think right now there is not full command compatibility, since only some of the commands are implemented. Do you just mean that the APIs that do exist are compatible?
yeah existing commands being fully compatible is good for now. Also the maintainers (whoever they are) agree that eventual full compat (meaning new commands as well) is a p0 goal by default. We can always discuss exceptions on a case-by-case basis. "incremental perfection" (R) :-)
@KarthikSubbarao How complete is your module?
What is done:
- Support for the Bloom Filter Module commands (compatible with the ReBloom Module syntax): BF.ADD, BF.EXISTS, BF.INFO, BF.INSERT, BF.MADD, BF.MEXISTS, BF.RESERVE
- Auto Scaling of Bloom Filters
- RDB save and load for Bloom Filter data types
- Configs for bloom filter expansion rate (used for scaling) and max size of bloom filters (number of element that can be "added")
- Additional Bloom data type callbacks: Copy command, Free, Memory Usage check, Defrag, Free Effort, etc.
- Initial sanity Memory Usage and Performance tests
What is remaining:
- Perf testing to set a baseline. We can decide on a baseline scenario and run tests & document results
- Integration Testing / Unit testing coverage
- Additional Bloom data type callbacks: AOF rewrite and Digest. These are generic Module data type callbacks that can be implemented in the Module.
- Memory Based restrictions - If the expected memory that will allocated upon a bloom write type operation (such as BF.REVERSE, BF.CREATE) will result in exceeding allowed memory, then we should reject the command. We need to check if any additional logic needed to handle this should be added to the Module.
- Additional Bloom specific Module configurations for customizing the created bloom objects & Tuning default/min/max config values.
- full command compat;
This Module supports every Bloom Filter command (from ReBloom) except for the BF.LOADCHUNK and BF.SCANDUMP and the commands have been implemented with ReBloom compatibility. The reason for not implementing the two cmds is because the Module provides the ability to load and save BloomModule data type items during RDB load and save. BF.LOADCHUNK and BF.SCANDUMP are APIs to load BloomModule data types through commands, but since we will provide RDB save & load and also AOF Rewrite, having specific commands for the same purpose was not considered as required. This can always be re-evaluated if we think it is useful
- active maintenance.
I would be glad to help with maintenance of the Module by addressing issues and having discussions on missing aspects that we would like to build into the Module's functionality and testing
@zuiderkwast I think @KarthikSubbarao's bloom filter is licensed under BSD 3-clause and it is the one being proposed here. My vote is yes on the same conditions as Viktor mentioned above: 1) full command compat; 2) active maintenance. Name wise, my preference would be Valkey-Bloom.
BF
is too short IMO and I would also prefer a dash afterValkey
Full command compat is one of the point I wanted clarification on for all the future modules we're planning to build/accept. As we don't have any data points for Redis Modules, one can't be sure which API(s) were really used. Do you think it's wise to build full compatibility? Right now the changes which @KarthikSubbarao has made supports all the bloom filter related API(s) but leaves out some of the other probabilistic filter(s). I think we should not strive for full command compatibility to accept a Module. Rather accept one if it meets the performance/memory/language/coding standards aspect of the project. We can always improve/add as per user(s) request.
As we don't have any data points for Redis Modules, one can't be sure which API(s) were really used.
We can argue the opposite way too without concrete data and this would become pure speculation at the end.
If there is a legit reason to not be fully compatible we can always take an exception but I think it is important to aim at a higher compat bar so that existing Redis users can migrate their workload seamlessly to Valkey. Any incompatibility adds adoption friction and they add up. I am not saying a module needs to be bit by bit compatible in order to be adopted under Valkey. I am talking about directional alignment on helping all customers move on to Valkey with minimum possible friction.
As we don't have any data points for Redis Modules, one can't be sure which API(s) were really used.
We can argue the opposite way too without concrete data and this would become pure speculation at the end.
If there is a legit reason to not be fully compatible we can always take an exception but I think it is important to aim at a higher compat bar so that existing Redis users can migrate their workload seamlessly to Valkey. Any incompatibility adds adoption friction and they add up. I am not saying a module needs to be bit by bit compatible in order to be adopted under Valkey. I am talking about directional alignment on helping all customers move on to Valkey with minimum possible friction.
Well the bloom filter module proposed here has all the bloom filter commands implemented. Remaining commands, technically don't fit under bloom filter they would ideally be under probabilistic filter.
@KarthikSubbarao could we also list out the remaining commands not built yet?
This Module supports every Bloom Filter command (from ReBloom) except for the BF.LOADCHUNK and BF.SCANDUMP and the commands have been implemented with ReBloom compatibility. The reason for not implementing the two cmds is because the Module provides the ability to load and save BloomModule data type items during RDB load and save. BF.LOADCHUNK and BF.SCANDUMP are APIs to load BloomModule data types through commands, but since we will provide RDB save & load and also AOF Rewrite, having specific commands for the same purpose was not considered as required.
This got me thinking about the on-disk format compatibility, which would be another very valuable property. Though I can see it being harder to achieve.
This can always be re-evaluated if we think it is useful
I agree.
Along the compat topic, I would also like the module maintainer to provide migration best practices, when applicable.
@KarthikSubbarao could we also list out the remaining commands not built yet?
Realized the other probabilistic filter/algorithm commands are each under different command namespace like Cuckoo filter commands s are under CF.*
, count min sketch commands are under CMS.*
, etc.
Ok, so we can eventually have separate modules for cuckoo and minsketch. Seems reasonable.
If there is a legit reason to not be fully compatible we can always take an exception but I think it is important to aim at a higher compat bar so that existing Redis users can migrate their workload seamlessly to Valkey. Any incompatibility adds adoption friction and they add up. I am not saying a module needs to be bit by bit compatible in order to be adopted under Valkey. I am talking about directional alignment on helping all customers move on to Valkey with minimum possible friction.
I think we should start with first principals and decide what we want the APIs to look, and then decide if we want to be API compatible with Redis. You are starting with the assumption that our users are migrating from Redis, but that need not be the case. They also might be net new developers, and we want to build the right application for them. We may want to alter the APIs to better suite those users.
It should always be evaluated case by case, and should not be a general tenet. I also would bias to skipping APIs that don't make a lot of sense. For example, I know in the search modules they implemented functionality like FT.CONFIG SET
, which has largely been replaced with the module config functionality.
This Module supports every Bloom Filter command (from ReBloom) except for the BF.LOADCHUNK and BF.SCANDUMP and the commands have been implemented with ReBloom compatibility. The reason for not implementing the two cmds is because the Module provides the ability to load and save BloomModule data type items during RDB load and save. BF.LOADCHUNK and BF.SCANDUMP are APIs to load BloomModule data types through commands, but since we will provide RDB save & load and also AOF Rewrite, having specific commands for the same purpose was not considered as required.
This got me thinking about the on-disk format compatibility, which would be another very valuable property. Though I can see it being harder to achieve.
This can always be re-evaluated if we think it is useful
I agree.
Along the compat topic, I would also like the module maintainer to provide migration best practices, when applicable.
On the compat topic, we have a lot of issues to deal with the issue that Redis RDB OP code has changed. I documented the issue here: https://github.com/valkey-io/valkey/issues/645. We don't have a good compatibility story in general with Redis.
@valkey-io/core-team Any TSC interested in helping shape this up? I think this would be a nice module to start with and help set the baseline for other modules in the future.
I'm not particularly interested in spending time with this, but I'm in favor of accepting it, with the relevant bloom filter commands being rebloom-compatible. It's no problem that it excludes non-bloom probabilistic filters (they can be provided by another module in the future) and the obsolete commands (dump/load).
I think this would be a nice module to start with and help set the baseline for other modules in the future.
+1. I am in favor of accepting this module too.
@hpatro I already spoke with Ping and Viktor privately, I will take this module support, Thanks