featurebase
featurebase copied to clipboard
Streaming/Paging support for large bitsets.
Description
If the returned bitmap from a Bitmap, Intersect, Union, etc. query is very large (100s of millions of bits) BAD things might happen. It is possible to just stream data back over HTTP without batching it all up first, https://stackoverflow.com/questions/16172524/limit-on-the-length-of-the-data-that-a-webserver-can-return-in-response-to-a-get
We should look into this. We could also try some kind of pagination (what are the pros/cons here?). As an additional optimization we could return the roaring formatted data instead of the array of integers we currently do - this would of course need good client support. I think I would vote against this option, as HTTP can already support compression, and if a client is trying to do something with the roaring formatted data, it should probably be doing it in Pilosa as a plugin.
Success criteria (What criteria will consider this ticket closeable?)
First, make a decision with how to proceed and spec this ticket out better.
The broad success criteria is that extremely large bitmaps can be returned to clients without crashing Pilosa or the client (the whole bitmap represented as an array of ints should not need to fit in memory in either place).
(Re-iterating the idea we discussed at the meeting)
We can use query request's Slices
parameter for paging (available for protobuf which all official client libraries support). If specified, the query would only run on that particular slice, which would result in maximum slice size bits in a response. If uniform sized results are desired, the client libraries may buffer results and can run a callback with the desired number of bits.
Specifying slices may help with concurrency too. Instead of sending a single query, the client would send many queries with the specific slices which should increase throughput. Later, this approach can be combined with consistent hashing to target nodes which have the specified slices.
Was this closed intentionally? If so, is it working by default in the official clients? (very cool) If so, should we (or have we?) made a separate ticket to implement the consistent hash stuff?
This was closed unintentionally when I merged a PR. Reopening.
@jaffee @benbjohnson implemented slices support for clients, but there's no paging/streaming support in the clients beyond that work.
That's about where I thought it was at - thanks @yuce
facing the same issues that are discussed here and think paging would be a solution to that.
Right now is new Options query is the way to do it?
Adding plugin label as this kind of functionality will become more important when we have algorithms or queries which need to potentially send lots of data between cluster nodes.