conduit
conduit copied to clipboard
Peformance: mapM_ is very inefficient
I have been struggling with why conduit performance results are out of the whack in these benchmarks https://github.com/composewell/streaming-benchmarks. I have root caused the issue to mapM_
to discard the stream. I think this usage was picked up from the benchmarks in the machines
package and I retained this. mapM_
is also used in other packages like pipes
and streaming
for the same purpose so I did not find anything wrong with using that, and it made it all the more harder to root cause this issue.
However, there is a drastic improvement in performance when sinkNull
is used instead of mapM_
. mapM_
is defined like this:
mapM_, mapM_C :: Monad m
=> (a -> m ())
-> ConduitT a o m ()
mapM_C f = awaitForever $ lift . f
and used like mapM_ (\_ -> return ())
to discard the stream.
sinkNull
is:
sinkNull, sinkNullC :: Monad m => ConduitT i o m ()
sinkNullC = awaitForever $ \_ -> return ()
There is not much difference except inline definition of the discarding function. Perhaps the polymorphism in mapM_
is problematic or the lift
or is it something else?
I think there is exactly the same problem with mapM
as well and that also explains why the mapM
benchmark result for conduit is significantly worse than other libraries.
Thanks for the report. I'm mostly AFK this week, so haven't had a chance to look at this. These kinds of optimizations are highly sensitive to GHC inlining rules. Can you clarify exactly which stack.yaml
file in your repo, and which commit, you're seeing these figures on?
The latest commit and stack.yaml should reproduce this.
./run.sh -- -m pattern conduit
should run all the conduit benchmarks. You can change this file, e.g. replace sinkNull
with mapM_ and run the benchmarks again, this time with the --append
option ./run.sh --append -- -m pattern conduit
, this will generate a comparison graph between the baseline and your new changes in the charts
directory.
Alright, I've looked into this, and have a few comments:
- Conduit is not, and has never been, a low-CPU-overhead framework. It's about I/O-bound tasks. It is well known and understood that Conduit's implementation puts GC pressure on tight inner loops, which is what your test suite is covering.
- Still, speed is nice, so Conduit has a stream fusion framework in it. However, immediately after implementation, it became clear that changes in GHC from version to version make it all but impossible to guarantee firing of this framework. (This point and the previous one are covered in my blog post First class stream fusion.)
- The question here got me curious enough to look into this in more detail. The first thing I discovered is that we can make the rewrite rules fire more reliably, which I've done in b134241f3ab545949d2cf5748de3ef012a0266b9.
- The setup of the benchmark suite makes it much more difficult for the rewrite rules to fire, preventing most of these optimizations. As an example, I've included a diff below that demonstrates a change that gives a 10x speedup (on my machine at least) for the
transformation/map/conduit
test. - Furthermore, it seems like it's possible to make that a 30x speedup if you bypass the indirection still present and construct a direct pipeline, as Conduit guides recommend. But this doesn't fit in with the benchmark framework.
In sum: it's nice to get conduit faster via rewrite rules, but it's not the intended use case of the library, and can't be relied upon. We know how to get super fast, reliable streaming code (direct stream fusion, as I laid out in my blog post), but the convenience of higher-level libraries wins out in almost all cases, which is why there doesn't seem to be pressure to make that happen.
diff --git a/Benchmarks/Conduit.hs b/Benchmarks/Conduit.hs
index 5baf369..cda392f 100644
--- a/Benchmarks/Conduit.hs
+++ b/Benchmarks/Conduit.hs
@@ -86,11 +86,13 @@ last = eliminate $ S.last
{-# INLINE transform #-}
transform :: Monad m => Pipe m Int Int -> Source m () Int -> m ()
-- mapM_ is much more costly compared to sinkNull
+transform t src = S.runConduit (src S..| t S..| S.mapM_ (\_ -> return ()))
--transform t = runStream (t S..| S.mapM_ (\_ -> return ()))
-transform t = runStream (t S..| S.sinkNull)
+--transform t = runStream (t S..| S.sinkNull)
scan = transform $ S.scanl (+) 0
-map = transform $ S.map (+1)
+--map = transform $ S.map (+1)
+map = \src -> S.runConduit $ src S..| S.map (+1) S..| S.mapM_ (\_ -> return ())
mapM = transform $ S.mapM return
filterEven = transform $ S.filter even
filterAllOut = transform $ S.filter (> maxValue)
diff --git a/stack.yaml b/stack.yaml
index 05d93fb..d0b2af2 100644
--- a/stack.yaml
+++ b/stack.yaml
@@ -18,4 +18,10 @@ extra-deps:
- lens-4.15.4
- free-4.12.4
- drinkery-0.3
+
+ - git: https://github.com/snoyberg/conduit
+ commit: b134241f3ab545949d2cf5748de3ef012a0266b9
+ subdirs:
+ - conduit
+
rebuild-ghc-options: true
I agree with your observations and that's why I do not have any recommendations of streaming libraries based on these performance numbers. They are to be used with caution. In fact I have this caveat in the README to warn against the tendency to misuse micro-benchmarks such as this:
When choosing a streaming library to use we should not be over obsessed about the performance numbers as long as the performance is within reasonable bounds. Whether the absolute performance or the differential among various libraries matters or not may depend on your workload. If the cost of processing the data is significantly higher compared to the streaming overhead then the difference may not matter at all; unless you are performing huge number of tiny operations, performance difference may not be significant.
The goal of these benchmarks was to explicitly measure just the cost of operations discounting fusion to not complicate the comparison/explanation of results. In fact I designed them such that fusion is avoided. So, good to know that its not working. I was also considering to design separate benchmarks to show the fusion capabilities of libraries, but that's not done yet.
My main point here was that since sinkNull
and mapM_
can be used for the same job interchangeably, it may result in significant difference in perf that can be potentially introduced inadvertently, the perf diff depends on the use case of course. If it is possible to fix it, it is nice to have that fix. Exactly the same problem is seen with pipes as well, there it isdiscard
vs mapM_
.
I encountered a situation in practice where
someConduit .| awaitForever (\_ -> return ())
performs substantially worse than
someConduit .| sinkNull
I guess this is related to what you discussed above, as sinkNull
is implemented under the hood exactly as awaitForever (\_ -> return ())
apart from inlining differences.
Have there been any further developments on this matter in the past year?