vector
vector copied to clipboard
unstreamM etc destroys performance; expose unstreamPrimM & document most-performant method
I was messing around with squeezing performance out of this library.
-- utility function used in the rest of the code
fillOrderPart :: (Monad m, Ord a, Num a) => a -> StateT a m a
fillOrderPart c = state $ \r -> let x = min c r in (c - x, r - x)
fillOrder1V :: (Data.Vector.Generic.Vector v i, Ord i, Num i) => v i -> i -> (v i, i)
fillOrder1V = runState . Data.Vector.Generic.mapM fillOrderPart
{-
benchmarking static/vector-unboxed/fillOrder1
time 53.16 μs (53.02 μs .. 53.31 μs)
benchmarking static/vector-storable/fillOrder1
time 52.89 μs (52.59 μs .. 53.22 μs)
-}
fillOrder1VM :: (Data.Vector.Generic.Vector v i, Ord i, Num i) => v i -> i -> (v i, i)
fillOrder1VM book order = runST $ flip runStateT order $ do
book' <- VG.thaw book
VGM.mapM_ fillOrderPart book'
VG.unsafeFreeze book'
{-
benchmarking static/vector-unboxed-copy-mut/fillOrder1
time 11.20 μs (11.19 μs .. 11.21 μs)
benchmarking static/vector-storable-copy-mut/fillOrder1
time 11.51 μs (11.46 μs .. 11.55 μs)
-}
fillOrder1VB2 :: (Data.Vector.Generic.Vector v i, Ord i, Num i) => v i -> i -> (v i, i)
fillOrder1VB2 book order =
-- FIXME: unstreamM performs horribly to list construction, sadly
runState (VG.unstreamM $ VFB.mapM fillOrderPart (VG.stream book)) order
{-
benchmarking static/vector-unboxed-bundle-unstreamM/fillOrder1
time 26.30 μs (26.24 μs .. 26.35 μs)
benchmarking static/vector-storable-bundle-unstreamM/fillOrder1
time 46.10 μs (45.85 μs .. 46.26 μs)
-}
-- bug in vector; defined but not exported. we re-defined it for our use here
unstreamPrimM :: (PrimMonad m, VG.Vector v a) => VFB.MBundle m u a -> m (v a)
{-# INLINE [1] unstreamPrimM #-}
unstreamPrimM s = VGM.munstream s >>= VG.unsafeFreeze
fillOrder1VB :: (Data.Vector.Generic.Vector v i, Ord i, Num i) => v i -> i -> (v i, i)
fillOrder1VB book order =
runST $ flip runStateT order $ unstreamPrimM $ VFB.mapM fillOrderPart $ VG.stream book
{-
benchmarking static/vector-unboxed-bundle/fillOrder1
time 1.426 μs (1.421 μs .. 1.432 μs)
benchmarking static/vector-storable-bundle/fillOrder1
time 855.9 ns (852.2 ns .. 859.3 ns)
-}
By contrast, here is the equivalent rust code:
fn fill_order(book: &mut Vec<u64>, order: u64) -> u64 {
let mut r = order;
for c in book.iter_mut() {
let x = r.min(*c);
r -= x;
*c -= x;
}
r
}
// fill_order time: [868.87 ns 870.21 ns 871.76 ns]
As you can see, the properly-written fusion version performs as well as rust. This was a very pleasant surprise for me. HOWEVER - the correct way is not documented anywhere!!! In particular, the convenience function unstreamPrimM
is for some reason defined in the source code of Vector.Generic
, used no-where else, not exported nor advertised, yet is absolutely vital for reaching this nirvana of performance.
By contrast, unstreamM is what's exposed in the API and destroys the performance so it performs even worse than the manual imperative mutable version. Even worse, all the utility functions are written in terms of unstreamM, e.g. VG.mapM, etc etc. Yes this means they have a convenient Monad m =>
constraint, but any non-haskell-expert that cares about performance would benchmark it and write off the library as "Haskell is slow". Providing mirror utilities that have a PrimMonad m =>
constraint that use unstreamPrimM
instead of unstreamM
, as well as a few examples, would help this effort.
For reference, the above took me about half a day. Not everybody exploring Haskell has that sort of time or patience.