node-bloem icon indicating copy to clipboard operation
node-bloem copied to clipboard

stringify method for persistence?

Open loveencounterflow opened this issue 9 years ago • 4 comments

i'm using bloem to quickly test whether a given key may have been already inserted into a database. For this to work properly, i need to persist the state of the bloom filter; right now i'm doing essentially

BSON        = ( require 'bson' ).BSONPure.BSON
bloom_bfr   = BSON.serialize old_bloom
... write to storage ...
... later, read from storage ...
bloom_data  = BSON.deserialize bloom_bfr
# now we have to repair the deserialized data:
for filter in bloom_data[ 'filters' ]
  bitfield              = filter[ 'filter' ][ 'bitfield' ]
  bitfield[ 'buffer' ]  = bitfield[ 'buffer' ][ 'buffer' ]
new_bloom   = BLOEM.ScalingBloem.destringify bloom_data

While i'm taking advantage of bson's ability to efficiently serialize buffers, the solution does suffer from the strange property of bson that it insists on deserializing into a slightly different format from what you gave to it (IOW you don't get round-trip invariance as soon as a buffer is involved. i have no idea what that could be good for).

this would seem to work but leaves open the question what the recommended way of persisting a node-bloem filter is? Also, one might add that the destringify method has a confusing name, since it does not accept a string but a suitably prepared JS object.

loveencounterflow avatar Jun 16 '15 16:06 loveencounterflow

would functions that serialize/deserialize a filter object to/from Buffer be useful?

wiedi avatar Jun 16 '15 23:06 wiedi

That is exactly my question (and sorry to be sort of late here). Ideally it would be as simple as using JSON, e.g. bloom_bfr = BLOEM.stringify old_bloom and new_bloom = BLOEM.parse bloom_bfr (module another choice of names for those methods, and/or attaching the stringify method to instances, not to the library).

The primary use case of this is of course to allow using a given filter over an existing collection across locations and across process lifetimes, which IMHO is actually the reason to use a Bloom filter at all. If you can't store and re-instantiate a Bloom filter you're pretty much limited to whatever you can do within the lifetime of a single process.

loveencounterflow avatar Jul 10 '15 19:07 loveencounterflow

So currently you can serialize a filter to a JSON string with:

var f = new bloem.Bloem(8, 2)
var persist_this = JSON.stringify(f)

To deserialize use:

var f = bloem.Bloem.destringify(JSON.parse(persisted_thing))

I agree that the destringify name is confusing. I am open for better name suggestions.

I also have ideas about a binary format (so serialize to Buffer) but if this is not what you need (because you're happy with JSON) I will hold of with implementing that until I need it.

wiedi avatar Jul 10 '15 20:07 wiedi

So I tested your suggestions and they seem to work. That said, i believe that still leaves open some questions:

  • Whatever way is considered the Right Way to serialize and resurrect a given filter, it is not mentioned in the readme.
  • The interface is unobvious and asymmetric; one has to serialize wirh JSON but deserialize using a composition of calls to both JSON and a node-bloem method.
  • When deserializing, i have to know exactly which of the three destringify methods i have to call, IOW the serialized data is incomplete in so far as it does not contain type information (granted all serialized data is always up to speculation, but it still would simplify the API if there was a single destringify method on the library that chooses the right implementation class for me).

And yes, the method names are confusing; i'd suggest either BLOEM.stringify and BLOEM.parse (as JSON does) or BLOEM.serialize and BLOEM.deserialize (the more logical choice).

As for the BSON part and the question of 'going binary', i've since thrown out that part already upon learning that JSON suffices. It's just another dependency in the end with some annoying properties and an undocumented API change that made me loose time.

Whether a truly binary format is needed would appear to hinge on the question whether it could be faster and/or smaller than new Buffer JSON.stringify bloom_filter plus whatever optimization (like Gzip or LevelDB's compression) can offer.

loveencounterflow avatar Jul 11 '15 11:07 loveencounterflow