groundhog icon indicating copy to clipboard operation
groundhog copied to clipboard

Streaming interface for queries?

Open hvr opened this issue 10 years ago • 8 comments

Currently, I use select as in e.g.

            coms <- runDbConn (select ((ComBoardField ==. BoardIdKey bid) &&.
                                       (ComArtNumField >=. lo') &&.
                                       (ComArtNumField <=. hi'))) cm

to fetch up to about 20000 rows from a table. However, this seems to result in 60MiB of ARR_WORDS being allocated according to Heap profiling, which is far more than the underlying (Sqlite) database contains actual data.

Is there a way to consume result values as soon as they get returned, rather than all at once? Maybe e.g. via a fold-like API?

hvr avatar Jul 01 '15 16:07 hvr

The size of data loaded in memory may be 2-4 times bigger than in the database because of boxed representation in Haskell. Adding a streaming interface involves two issues:

  1. Where to put the streaming functions separating them from functions that return lists. This can be done by adding selectStream, projectStream to PersistBackend. Alternatively, they may be in another class or module. I am open to the name suggestions.
  2. Choosing the interface. There are two popular streaming libraries: Pipes and Conduit and they both have solid infrastructure. Currently Groundhog has a type RowPopper that existed in Persistent prior to conduit interface. The function selectStream can use either it or Source/Producer from Conduit/Pipes respectively. In the first case we can have libraries like groundhog-pipes that transform RowPopper to the actual type. In the second case Groundhog will be tied into one of the libraries.

lykahb avatar Jul 02 '15 16:07 lykahb

The size of data loaded in memory may be 2-4 times bigger than in the database because of boxed representation in Haskell.

Yeah, but I'm surprised that the majority of heap objects are ARR_WORDS rather than (boxed) constructors...

I've got no strong opinion about the API. At the end of the day I just need a fold-style (and/or foldM-style) API, as I work mostly with Builders, and sometimes I want to interface with simple low-overhead ByteString.Builder -> IO ()-style sinks, for which I deliberately avoid the use of any high-level streaming framework. So I guess if I can "dumb down" a pipes tuple-source to a trivial fold-style API w/o paying too much overhead, I'll be happy :-)

hvr avatar Jul 03 '15 07:07 hvr

@lykahb

I'm currently writing a RethinkDB backend and am particularly interested in this issue. Having a Streaming instance opens up change subscriptions in Rethink and tailable cursors in MongoDB (Although this would probably be another class StreamingSubscribe or something).

I believe the Cursor only needs to be implemented in terms of two functions close (this is debatable since cursors should be automatically close, but could be handy for the subscriptions). and next which returns (Just Entity). From that we can build mapping/folding implementations. However this requires a cursor object which sql-simple doesn't expose.

We could implement it in terms of a fold (which I think all sql-simple packages already have implemented). PersistEntity v -> m () would be in the spirit of the library but it could also be a typefamily associated with Cursor a, CursorRow a, and the fold could be CursorRow a -> m () and make CursorRow a able to convert to Entity v, which would open up two functions foldRows and foldEntityRows.

I don't think groundhog needs to be tied to any other iterator libraries. Although a clean implementation would leave open the possibility of a lightweight groundhog-conduit down the road.

EDIT: Actually a conduit library would be a pretty cool combination with websockets or other network that would let you publish db changes or events easily. With a StreamingSubscribe it could resemble something like meteor.

hansonkd avatar Sep 02 '15 01:09 hansonkd

I have just committed a streaming interface. It was written a while ago and was going to push it after writing at least either groundhog-pipes or groundhog-conduit. It has a bracket-like function interface that passes next and closes automatically. Please let me know what you think about ca2e901ae38774040721a0b85f974cab9e301130

lykahb avatar Sep 02 '15 02:09 lykahb

@lykahb What's the ETA for groundhog-pipes btw?

hvr avatar Sep 02 '15 11:09 hvr

This looks like it should work! Thanks.

When I finish up my RethinkDB backend I'll take a look at implement groundhog-conduit

hansonkd avatar Sep 02 '15 23:09 hansonkd

Sorry, @hvr. I've just noticed your comment. I got back to groundhog-pipes yesterday. It is harder than I thought to use bracket-like function in there. Perhaps, as @hansonkd suggested, I will use next and close.

lykahb avatar Sep 04 '15 20:09 lykahb

I've committed groundhog-pipes. After logging is fixed, the packages will be ready for release!

lykahb avatar Sep 12 '15 02:09 lykahb