vector unstreamM etc destroys performance; expose unstreamPrimM & document most-performant method

I was messing around with squeezing performance out of this library.

-- utility function used in the rest of the code
fillOrderPart :: (Monad m, Ord a, Num a) => a -> StateT a m a
fillOrderPart c = state $ \r -> let x = min c r in (c - x, r - x)

fillOrder1V :: (Data.Vector.Generic.Vector v i, Ord i, Num i) => v i -> i -> (v i, i)
fillOrder1V = runState . Data.Vector.Generic.mapM fillOrderPart
{-
benchmarking static/vector-unboxed/fillOrder1
time                 53.16 μs   (53.02 μs .. 53.31 μs)
benchmarking static/vector-storable/fillOrder1
time                 52.89 μs   (52.59 μs .. 53.22 μs)
-}

fillOrder1VM :: (Data.Vector.Generic.Vector v i, Ord i, Num i) => v i -> i -> (v i, i)
fillOrder1VM book order = runST $ flip runStateT order $ do
  book' <- VG.thaw book
  VGM.mapM_ fillOrderPart book'
  VG.unsafeFreeze book'
{-
benchmarking static/vector-unboxed-copy-mut/fillOrder1
time                 11.20 μs   (11.19 μs .. 11.21 μs)
benchmarking static/vector-storable-copy-mut/fillOrder1
time                 11.51 μs   (11.46 μs .. 11.55 μs)
-}

fillOrder1VB2 :: (Data.Vector.Generic.Vector v i, Ord i, Num i) => v i -> i -> (v i, i)
fillOrder1VB2 book order =
  -- FIXME: unstreamM performs horribly to list construction, sadly
  runState (VG.unstreamM $ VFB.mapM fillOrderPart (VG.stream book)) order
{-
benchmarking static/vector-unboxed-bundle-unstreamM/fillOrder1
time                 26.30 μs   (26.24 μs .. 26.35 μs)
benchmarking static/vector-storable-bundle-unstreamM/fillOrder1
time                 46.10 μs   (45.85 μs .. 46.26 μs)
-}

-- bug in vector; defined but not exported. we re-defined it for our use here
unstreamPrimM :: (PrimMonad m, VG.Vector v a) => VFB.MBundle m u a -> m (v a)
{-# INLINE [1] unstreamPrimM #-}
unstreamPrimM s = VGM.munstream s >>= VG.unsafeFreeze

fillOrder1VB :: (Data.Vector.Generic.Vector v i, Ord i, Num i) => v i -> i -> (v i, i)
fillOrder1VB book order =
  runST $ flip runStateT order $ unstreamPrimM $ VFB.mapM fillOrderPart $ VG.stream book
{-
benchmarking static/vector-unboxed-bundle/fillOrder1
time                 1.426 μs   (1.421 μs .. 1.432 μs)
benchmarking static/vector-storable-bundle/fillOrder1
time                 855.9 ns   (852.2 ns .. 859.3 ns)
-}

By contrast, here is the equivalent rust code:

fn fill_order(book: &mut Vec<u64>, order: u64) -> u64 {
  let mut r = order;
  for c in book.iter_mut() {
    let x = r.min(*c);
    r -= x;
    *c -= x;
  }
  r
}
// fill_order              time:   [868.87 ns 870.21 ns 871.76 ns]

As you can see, the properly-written fusion version performs as well as rust. This was a very pleasant surprise for me. HOWEVER - the correct way is not documented anywhere!!! In particular, the convenience function unstreamPrimM is for some reason defined in the source code of Vector.Generic, used no-where else, not exported nor advertised, yet is absolutely vital for reaching this nirvana of performance.

By contrast, unstreamM is what's exposed in the API and destroys the performance so it performs even worse than the manual imperative mutable version. Even worse, all the utility functions are written in terms of unstreamM, e.g. VG.mapM, etc etc. Yes this means they have a convenient Monad m => constraint, but any non-haskell-expert that cares about performance would benchmark it and write off the library as "Haskell is slow". Providing mirror utilities that have a PrimMonad m => constraint that use unstreamPrimM instead of unstreamM, as well as a few examples, would help this effort.

For reference, the above took me about half a day. Not everybody exploring Haskell has that sort of time or patience.

Sep 11 '21 23:09 infinity0

We must face the fact that unstreamPrimM is an utterly unsafe function. Once non-deterministic monad sits in the place of m, immutable buffers can be easily rewritten and one of Haskell's nasal demons pops out everywhere, and I'm not sure it's the only demon living in this function. If we want to expose utilies with constraints PrimMonad, we would need to prefix them with unsafePrim or something and document them with expected undefined behaviors. That way, the situation will not get any easier to non-experts, I suppose. I guess that's the reason why only unstreamPrimM_IO and unstreamPrimM_ST are exported through rewrite rules (We anyway cannot expose unstreamPrimM through rules, though).

Once we are sure that my example is the only undefined behavior of the function, we could expose PrimMonad-related utilities with more strict constraints if we could write such constraint. but what's the constraint we want? With what method in a type class can we prohibit non-determinism?

Sep 12 '21 01:09 gksato

Wait -- what if we employed freeze instead of unsafeFreeze in the definition of unstreamPrimM ? It seems to contain no undefined behavior, but doesn't it really have any? And does it work in the way we expect?

Sep 12 '21 01:09 gksato

I see your point about safety. It looks like there's some understanding of this elsewhere e.g. https://www.tweag.io/blog/2021-02-10-linear-base/ "the [linear approach] even makes unsafeFreeze safe".

I didn't consider non-determinism before, but the behaviour would be safe for a linear-in-spirit monad (like State and all the other "common" monads) together with the condition that the caller should not themselves reuse the input of unstreamPrimM, which would be the case e.g. in a redefined mapM like mapPrimM f = unstreamPrimM . Bundle.mapM f . stream that performs the streaming and unstreaming internally already.

we could expose PrimMonad-related utilities with more strict constraints if we could write such constraint. but what's the constraint we want?

Simply PrimMonad m => instead of Monad m => works, if unstreamM is changed to unstreamPrimM - or I suppose waiting for linear-base's linear Monad to be production-ready would be a cleaner & more long-term solution.

Changing unsafeFreeze to freeze (which is just clone >=> unsafeFreeze) gives me the following benchmark numbers:

freeze
benchmarking static/vector-unboxed/fillOrder1VF
time                 1.220 μs   (1.214 μs .. 1.227 μs)
benchmarking static/vector-storable/fillOrder1VF
time                 1.362 μs   (1.358 μs .. 1.365 μs)

unsafeFreeze
benchmarking static/vector-unboxed/fillOrder1VF
time                 1.510 μs   (1.503 μs .. 1.516 μs)
benchmarking static/vector-storable/fillOrder1VF
time                 1.191 μs   (1.189 μs .. 1.192 μs)

(I refactored the code slightly from what I had in the OP; latest version available in https://github.com/infinity0/lang-benchmarks)

It looks like performance is not significantly affected at least in this simple case (perhaps GHC optimises away the clone?), though it's incredibly weird that unsafeFreeze/unboxed is slower than freeze/unboxed (I ran this several times and got the same weird result).

Sep 12 '21 11:09 infinity0

Simply PrimMonad m => instead of Monad m => works, if unstreamM is changed to unstreamPrimM -

ListT IO a and ListT (ST s) a (see ListT done right) or a monadic parser transformer applied to IO is a "common" non-deterministic PrimMonad. I wouldn't want to see non-experts puzzled by mysteriously transformed immutable vectors just because they used mapPrimM against ParsecT e s (ST t). I don't know what maintainers think about this, though.

or I suppose waiting for linear-base's linear Monad to be production-ready would be a cleaner & more long-term solution.

That would be a sensible solution if it works. I haven't looked deep into GHC's linear monad yet, but it sounds promising.

It looks like performance is not significantly affected at least in this simple case (perhaps GHC optimises away the clone?), though it's incredibly weird that unsafeFreeze/unboxed is slower than freeze/unboxed (I ran this several times and got the same weird result).

See #409 for another example of behavior you observed. There is also some discussion about that there.

Sep 12 '21 13:09 gksato

unstreamPrimM is more performant and is unsafe. Unsafety alone should not stop us. We have full complement of unsafe functions in the name of performance.

What's crucial is to understand how exactly they're unsafe and when it's safe to use them. unstreamPrimM is fast because it uses single mutable buffer in order to create vector. This of course causes problem in non-determinism monads. Here I'm using list-t package:

import qualified Data.Vector.Unboxed         as U
import qualified Data.Vector.Fusion.Bundle.Monadic as BM
import ListT

bundleList :: BM.Bundle (ListT IO) v Int
bundleList = BM.generateM 2 $ \i -> fromFoldable [i, 100+i]

goBad :: IO [U.Vector Int]
goBad = toList $ unstreamPrimM bundleList

>>> goBad
[[100,101],[100,101],[100,101],[100,101]]

Since all vector share same mutable buffer they all end up same! This causes problem since we continue to modify mutable vector after we called unsafeFreeze. @gksato proposed to use freeze and copy buffer. This does help for proper ListT. It doesn't work in case of ListT from old transformers. This is because we still have only single buffer and writes and freeze interleaved differently.

I suspect that this happens because ListT IO from transformers is not proper monad, it violate monadic laws. So variant of unstreamPrimM which uses freeze could be safe for lawful monads. I however can't prove this.

Sep 27 '21 16:09 Shimuuar

I've been thinking of this safety issues these days. Of course unstreamPrimM with freeze is safe (in the sense can be marked TrustWorthy), since it does nothing unsound. What we have to make sure is that unstreamPrimM and unstreamM are observably equivalent. For that, I think we need lots of laws, including monad laws:

First of all, monad laws on the PrimMonad m.
The following PrimMonad laws on m, which is undocumented. Note that they resemble MonadIO laws.
1. primitive (\s -> (# s, x #)) = return x, or equivalently stToPrim (return x) = return x
2. primitive (\s0 -> let (# s1, x #) = m# s0 in f# x s1) = primitive m# >>= primitive . f#, or equivalently stToPrim (m >>= f) = stToPrim m >>= stToPrim . f
Undocumented MVector laws on the pair of v and a with MVector v a, which includes but is absolutely not limited to the following. I guess it's appropriate to assume the ST s case, I don't know whether we can prove the general PrimMonad case from that base case (ST s case).
1. (,) <$> read v i <*> read w j = flip (,) <$> read w j <*> read v i
2. write v i a *> read v j = read v j <* write v i a given i /= j.
3. write v i a *> read v i = a <$ write v i a
4. write v i a *> write v j b = write v j b *> write v i a given i /= j.
5. write v i a *> write v i b = write v i b
6. (,) <$> write v i a <*> m = flip (,) <$> m <*> write v i a where m is a "safe" operation that does not refer to a vector that overlaps with v.

Nov 18 '21 14:11 gksato

And thanks to the explanation of @Shimuuar, I guess I got your point better, @infinity0. Obviously It's better if we have unstreamPrimM exposed. Hypothetical beginners are not the only users of this library.

Nov 18 '21 21:11 gksato

vector vector copied to clipboard

unstreamM etc destroys performance; expose unstreamPrimM & document most-performant method

vector
vector copied to clipboard