store
store copied to clipboard
Better streaming support
At the moment, for decoding streams there is an incremental function and a conduit, and for encoding there is only a conduit.
It would be better to provide a good generic streaming API, perhaps using a CPS approach as in streaming-commons, and then split off separate packages store-conduit, store-pipes, etc. That would also cut the dependencies down significantly.
I will write a pipes-store
if you adopt this model. I think it would be clearer this way. Ideally, the module would be a streaming-commons-store
. If that approach were typical, I think the structure of these things would become a little clearer to people. In fact, conduitEncode
just maps encodeMessage
over the stream, so it is the one function conduitEncode
alone that is really requiring the dependencies.
On the other hand, in defense of the current arrangement, conduit
is in fact a quite lightweight dependency, and if I count right, the two operations that require conduit
just need resourcet
, mmorph
and maybe exceptions
. The other things conduit needs like monad-control
are already being used. (Similarly, adding two pipes operations on top of that would just add pipes
)
The streaming support is already conduit
-indipendent, see https://github.com/fpco/store/blob/master/src/Data/Store/Streaming.hs . CC @kantp
Actually, I think you're proposing to change the streaming interface. Reopening
@ygale I think you're proposing two things:
- Split the conduit support in another package
- Make the conduit-indipendent streaming API different
For what concerns 1, I prefer fewer packages than split packages, but I think @mgsloan and @snoyberg are the best people to decide.
For what concerns 2, could you give a concrete API closer to what you'd like to see? Presumably something not involving the ByteBuffer
but without conduit
(e.g. something eating ByteString
s)?
If there's enough demand, we could split it out. I usually am in favor of fewer packages (less hassle), but since store
is useful for many many applications. It may well be worth it in this case to reduce dependenices if that causes people to not use it.
- Yes. Although as a conduit user I personally do not need the package split, I see this as potentially becoming an important infrastructure package across the entire Haskell ecosystem. I think it would server that role better without the conduit dependency in the core package. But I realize that by suggesting this, I am volunteering other people to do more work. So - it is just a suggestion.
- I'm not sure if the streaming interface needs to be changed or not; perhaps it just needs more documentation. Let's say I have a data type where one of the fields is a huge chunk of
Text
that I do not want in memory all at once. In my data type it might be encoded as a conduit, or a pipe, or a lazyText
, orIO (Maybe Text)
to get the next chunk, or some other roll-your-own enumerator. How would I serialize that with the existing API?
I think such a split should be deferred until it's clear how we want to organize things + there is sufficient demand.
Here are provisional implementations of decoding and encoding operations for some other streaming libraries - io-streams, pipes and streaming, in one omnibus module https://gist.github.com/michaelt/d937472ec758e8d56bac855d9a2ba534 I'm still figuring out some features, so there's probably some bug somewhere. The pipes and streaming implementation of the decoding function could be streamlined, but an attempt I made just now introduced an obvious bug with the little test I was using, which fragments the byte stream into singleton Word8s before deserializing it.
The pipes and streaming implementations could not use the exported api, so a bit of it the internal material is copied into the module. Maybe this other material should be exported, together with some title like "kit for implementing support for other streaming libraries" or something. The exported api is perfect for io-streams.
The results from primitive benchmarking were like so:
# deserialize, map ord and sum
$ time ./store_stream_bench ci real 0m1.312s -- conduit
$ time ./store_stream_bench si real 0m0.936s -- streaming
$ time ./store_stream_bench pi real 0m6.430s -- pipes
$ time ./store_stream_bench ii real 0m0.109s -- io-streams
# serialize long stream of chars
$ time ./store_stream_bench co real 0m4.426s -- conduit
$ time ./store_stream_bench so real 0m0.975s -- streaming
$ time ./store_stream_bench po real 0m0.943s -- pipes
$ time ./store_stream_bench io real 0m0.955s -- io-streams
I was just using this to check different implementations while I wrote them, it can't be taken seriously. Maybe I will criterionize it.
Wow, good stuff! Cool to see code integrating store
with more libraries! It'll also be interesting to see where the performance difference with serializing via conduit comes from.
I criterionized the benchmarks for what it's worth. https://gist.github.com/michaelt/d937472ec758e8d56bac855d9a2ba534#file-store_stream_bench-hs It needs a somewhat different structure, but the odd slowness of conduit deserializing seems to hold up
$ ./store_stream_bench_c
benchmarking serialize/iostreams
time 131.1 μs (129.3 μs .. 132.8 μs)
0.999 R² (0.998 R² .. 0.999 R²)
mean 131.1 μs (129.8 μs .. 132.8 μs)
std dev 4.963 μs (3.851 μs .. 6.270 μs)
variance introduced by outliers: 37% (moderately inflated)
benchmarking serialize/streaming
time 343.0 μs (339.8 μs .. 347.4 μs)
0.998 R² (0.997 R² .. 0.999 R²)
mean 354.6 μs (350.9 μs .. 360.6 μs)
std dev 15.50 μs (12.41 μs .. 22.86 μs)
variance introduced by outliers: 40% (moderately inflated)
benchmarking serialize/pipes
time 2.202 ms (2.178 ms .. 2.221 ms)
0.999 R² (0.998 R² .. 1.000 R²)
mean 2.182 ms (2.167 ms .. 2.196 ms)
std dev 51.25 μs (41.99 μs .. 69.38 μs)
variance introduced by outliers: 10% (moderately inflated)
benchmarking serialize/conduit
time 1.425 ms (1.411 ms .. 1.437 ms)
0.999 R² (0.999 R² .. 1.000 R²)
mean 1.439 ms (1.426 ms .. 1.453 ms)
std dev 46.49 μs (35.57 μs .. 59.65 μs)
variance introduced by outliers: 19% (moderately inflated)
benchmarking deserialize/iostreams
time 521.9 μs (508.0 μs .. 535.9 μs)
0.991 R² (0.985 R² .. 0.995 R²)
mean 523.4 μs (506.7 μs .. 548.2 μs)
std dev 70.68 μs (47.35 μs .. 122.2 μs)
variance introduced by outliers: 86% (severely inflated)
benchmarking deserialize/streaming
time 443.2 μs (420.3 μs .. 472.7 μs)
0.975 R² (0.957 R² .. 0.990 R²)
mean 452.9 μs (431.4 μs .. 482.4 μs)
std dev 79.41 μs (63.40 μs .. 105.0 μs)
variance introduced by outliers: 91% (severely inflated)
benchmarking deserialize/pipes
time 626.2 μs (600.5 μs .. 661.8 μs)
0.981 R² (0.965 R² .. 0.994 R²)
mean 637.0 μs (611.8 μs .. 682.9 μs)
std dev 120.7 μs (79.81 μs .. 209.3 μs)
variance introduced by outliers: 92% (severely inflated)
benchmarking deserialize/conduit
time 4.759 ms (4.668 ms .. 4.845 ms)
0.996 R² (0.994 R² .. 0.998 R²)
mean 4.896 ms (4.828 ms .. 4.987 ms)
std dev 244.7 μs (196.6 μs .. 356.9 μs)
variance introduced by outliers: 30% (moderately inflated)