vector icon indicating copy to clipboard operation
vector copied to clipboard

unstreamM etc destroys performance; expose unstreamPrimM & document most-performant method

Open infinity0 opened this issue 2 years ago • 7 comments

I was messing around with squeezing performance out of this library.

-- utility function used in the rest of the code
fillOrderPart :: (Monad m, Ord a, Num a) => a -> StateT a m a
fillOrderPart c = state $ \r -> let x = min c r in (c - x, r - x)

fillOrder1V :: (Data.Vector.Generic.Vector v i, Ord i, Num i) => v i -> i -> (v i, i)
fillOrder1V = runState . Data.Vector.Generic.mapM fillOrderPart
{-
benchmarking static/vector-unboxed/fillOrder1
time                 53.16 μs   (53.02 μs .. 53.31 μs)
benchmarking static/vector-storable/fillOrder1
time                 52.89 μs   (52.59 μs .. 53.22 μs)
-}

fillOrder1VM :: (Data.Vector.Generic.Vector v i, Ord i, Num i) => v i -> i -> (v i, i)
fillOrder1VM book order = runST $ flip runStateT order $ do
  book' <- VG.thaw book
  VGM.mapM_ fillOrderPart book'
  VG.unsafeFreeze book'
{-
benchmarking static/vector-unboxed-copy-mut/fillOrder1
time                 11.20 μs   (11.19 μs .. 11.21 μs)
benchmarking static/vector-storable-copy-mut/fillOrder1
time                 11.51 μs   (11.46 μs .. 11.55 μs)
-}

fillOrder1VB2 :: (Data.Vector.Generic.Vector v i, Ord i, Num i) => v i -> i -> (v i, i)
fillOrder1VB2 book order =
  -- FIXME: unstreamM performs horribly to list construction, sadly
  runState (VG.unstreamM $ VFB.mapM fillOrderPart (VG.stream book)) order
{-
benchmarking static/vector-unboxed-bundle-unstreamM/fillOrder1
time                 26.30 μs   (26.24 μs .. 26.35 μs)
benchmarking static/vector-storable-bundle-unstreamM/fillOrder1
time                 46.10 μs   (45.85 μs .. 46.26 μs)
-}

-- bug in vector; defined but not exported. we re-defined it for our use here
unstreamPrimM :: (PrimMonad m, VG.Vector v a) => VFB.MBundle m u a -> m (v a)
{-# INLINE [1] unstreamPrimM #-}
unstreamPrimM s = VGM.munstream s >>= VG.unsafeFreeze

fillOrder1VB :: (Data.Vector.Generic.Vector v i, Ord i, Num i) => v i -> i -> (v i, i)
fillOrder1VB book order =
  runST $ flip runStateT order $ unstreamPrimM $ VFB.mapM fillOrderPart $ VG.stream book
{-
benchmarking static/vector-unboxed-bundle/fillOrder1
time                 1.426 μs   (1.421 μs .. 1.432 μs)
benchmarking static/vector-storable-bundle/fillOrder1
time                 855.9 ns   (852.2 ns .. 859.3 ns)
-}

By contrast, here is the equivalent rust code:

fn fill_order(book: &mut Vec<u64>, order: u64) -> u64 {
  let mut r = order;
  for c in book.iter_mut() {
    let x = r.min(*c);
    r -= x;
    *c -= x;
  }
  r
}
// fill_order              time:   [868.87 ns 870.21 ns 871.76 ns]

As you can see, the properly-written fusion version performs as well as rust. This was a very pleasant surprise for me. HOWEVER - the correct way is not documented anywhere!!! In particular, the convenience function unstreamPrimM is for some reason defined in the source code of Vector.Generic, used no-where else, not exported nor advertised, yet is absolutely vital for reaching this nirvana of performance.

By contrast, unstreamM is what's exposed in the API and destroys the performance so it performs even worse than the manual imperative mutable version. Even worse, all the utility functions are written in terms of unstreamM, e.g. VG.mapM, etc etc. Yes this means they have a convenient Monad m => constraint, but any non-haskell-expert that cares about performance would benchmark it and write off the library as "Haskell is slow". Providing mirror utilities that have a PrimMonad m => constraint that use unstreamPrimM instead of unstreamM, as well as a few examples, would help this effort.

For reference, the above took me about half a day. Not everybody exploring Haskell has that sort of time or patience.

infinity0 avatar Sep 11 '21 23:09 infinity0