featurebase icon indicating copy to clipboard operation
featurebase copied to clipboard

Streaming/Paging support for large bitsets.

Open jaffee opened this issue 7 years ago • 7 comments

Description

If the returned bitmap from a Bitmap, Intersect, Union, etc. query is very large (100s of millions of bits) BAD things might happen. It is possible to just stream data back over HTTP without batching it all up first, https://stackoverflow.com/questions/16172524/limit-on-the-length-of-the-data-that-a-webserver-can-return-in-response-to-a-get

We should look into this. We could also try some kind of pagination (what are the pros/cons here?). As an additional optimization we could return the roaring formatted data instead of the array of integers we currently do - this would of course need good client support. I think I would vote against this option, as HTTP can already support compression, and if a client is trying to do something with the roaring formatted data, it should probably be doing it in Pilosa as a plugin.

Success criteria (What criteria will consider this ticket closeable?)

First, make a decision with how to proceed and spec this ticket out better.

The broad success criteria is that extremely large bitmaps can be returned to clients without crashing Pilosa or the client (the whole bitmap represented as an array of ints should not need to fit in memory in either place).

jaffee avatar Jun 14 '17 15:06 jaffee

(Re-iterating the idea we discussed at the meeting)

We can use query request's Slices parameter for paging (available for protobuf which all official client libraries support). If specified, the query would only run on that particular slice, which would result in maximum slice size bits in a response. If uniform sized results are desired, the client libraries may buffer results and can run a callback with the desired number of bits.

Specifying slices may help with concurrency too. Instead of sending a single query, the client would send many queries with the specific slices which should increase throughput. Later, this approach can be combined with consistent hashing to target nodes which have the specified slices.

yuce avatar Jun 16 '17 15:06 yuce

Was this closed intentionally? If so, is it working by default in the official clients? (very cool) If so, should we (or have we?) made a separate ticket to implement the consistent hash stuff?

jaffee avatar Jan 17 '18 17:01 jaffee

This was closed unintentionally when I merged a PR. Reopening.

yuce avatar Jan 17 '18 18:01 yuce

@jaffee @benbjohnson implemented slices support for clients, but there's no paging/streaming support in the clients beyond that work.

yuce avatar Jan 17 '18 18:01 yuce

That's about where I thought it was at - thanks @yuce

jaffee avatar Jan 17 '18 18:01 jaffee

facing the same issues that are discussed here and think paging would be a solution to that.

Right now is new Options query is the way to do it?

dmibor avatar Jan 31 '19 06:01 dmibor

Adding plugin label as this kind of functionality will become more important when we have algorithms or queries which need to potentially send lots of data between cluster nodes.

jaffee avatar Apr 30 '19 15:04 jaffee