vector
vector copied to clipboard
unstreamM etc destroys performance; expose unstreamPrimM & document most-performant method
I was messing around with squeezing performance out of this library.
-- utility function used in the rest of the code
fillOrderPart :: (Monad m, Ord a, Num a) => a -> StateT a m a
fillOrderPart c = state $ \r -> let x = min c r in (c - x, r - x)
fillOrder1V :: (Data.Vector.Generic.Vector v i, Ord i, Num i) => v i -> i -> (v i, i)
fillOrder1V = runState . Data.Vector.Generic.mapM fillOrderPart
{-
benchmarking static/vector-unboxed/fillOrder1
time 53.16 μs (53.02 μs .. 53.31 μs)
benchmarking static/vector-storable/fillOrder1
time 52.89 μs (52.59 μs .. 53.22 μs)
-}
fillOrder1VM :: (Data.Vector.Generic.Vector v i, Ord i, Num i) => v i -> i -> (v i, i)
fillOrder1VM book order = runST $ flip runStateT order $ do
book' <- VG.thaw book
VGM.mapM_ fillOrderPart book'
VG.unsafeFreeze book'
{-
benchmarking static/vector-unboxed-copy-mut/fillOrder1
time 11.20 μs (11.19 μs .. 11.21 μs)
benchmarking static/vector-storable-copy-mut/fillOrder1
time 11.51 μs (11.46 μs .. 11.55 μs)
-}
fillOrder1VB2 :: (Data.Vector.Generic.Vector v i, Ord i, Num i) => v i -> i -> (v i, i)
fillOrder1VB2 book order =
-- FIXME: unstreamM performs horribly to list construction, sadly
runState (VG.unstreamM $ VFB.mapM fillOrderPart (VG.stream book)) order
{-
benchmarking static/vector-unboxed-bundle-unstreamM/fillOrder1
time 26.30 μs (26.24 μs .. 26.35 μs)
benchmarking static/vector-storable-bundle-unstreamM/fillOrder1
time 46.10 μs (45.85 μs .. 46.26 μs)
-}
-- bug in vector; defined but not exported. we re-defined it for our use here
unstreamPrimM :: (PrimMonad m, VG.Vector v a) => VFB.MBundle m u a -> m (v a)
{-# INLINE [1] unstreamPrimM #-}
unstreamPrimM s = VGM.munstream s >>= VG.unsafeFreeze
fillOrder1VB :: (Data.Vector.Generic.Vector v i, Ord i, Num i) => v i -> i -> (v i, i)
fillOrder1VB book order =
runST $ flip runStateT order $ unstreamPrimM $ VFB.mapM fillOrderPart $ VG.stream book
{-
benchmarking static/vector-unboxed-bundle/fillOrder1
time 1.426 μs (1.421 μs .. 1.432 μs)
benchmarking static/vector-storable-bundle/fillOrder1
time 855.9 ns (852.2 ns .. 859.3 ns)
-}
By contrast, here is the equivalent rust code:
fn fill_order(book: &mut Vec<u64>, order: u64) -> u64 {
let mut r = order;
for c in book.iter_mut() {
let x = r.min(*c);
r -= x;
*c -= x;
}
r
}
// fill_order time: [868.87 ns 870.21 ns 871.76 ns]
As you can see, the properly-written fusion version performs as well as rust. This was a very pleasant surprise for me. HOWEVER - the correct way is not documented anywhere!!! In particular, the convenience function unstreamPrimM
is for some reason defined in the source code of Vector.Generic
, used no-where else, not exported nor advertised, yet is absolutely vital for reaching this nirvana of performance.
By contrast, unstreamM is what's exposed in the API and destroys the performance so it performs even worse than the manual imperative mutable version. Even worse, all the utility functions are written in terms of unstreamM, e.g. VG.mapM, etc etc. Yes this means they have a convenient Monad m =>
constraint, but any non-haskell-expert that cares about performance would benchmark it and write off the library as "Haskell is slow". Providing mirror utilities that have a PrimMonad m =>
constraint that use unstreamPrimM
instead of unstreamM
, as well as a few examples, would help this effort.
For reference, the above took me about half a day. Not everybody exploring Haskell has that sort of time or patience.
We must face the fact that unstreamPrimM
is an utterly unsafe function. Once non-deterministic monad sits in the place of m
, immutable buffers can be easily rewritten and one of Haskell's nasal demons pops out everywhere, and I'm not sure it's the only demon living in this function. If we want to expose utilies with constraints PrimMonad
, we would need to prefix them with unsafePrim
or something and document them with expected undefined behaviors. That way, the situation will not get any easier to non-experts, I suppose. I guess that's the reason why only unstreamPrimM_IO
and unstreamPrimM_ST
are exported through rewrite rules (We anyway cannot expose unstreamPrimM
through rules, though).
Once we are sure that my example is the only undefined behavior of the function, we could expose PrimMonad
-related utilities with more strict constraints if we could write such constraint. but what's the constraint we want? With what method in a type class can we prohibit non-determinism?
Wait -- what if we employed freeze
instead of unsafeFreeze
in the definition of unstreamPrimM
? It seems to contain no undefined behavior, but doesn't it really have any? And does it work in the way we expect?
I see your point about safety. It looks like there's some understanding of this elsewhere e.g. https://www.tweag.io/blog/2021-02-10-linear-base/ "the [linear approach] even makes unsafeFreeze
safe".
I didn't consider non-determinism before, but the behaviour would be safe for a linear-in-spirit monad (like State
and all the other "common" monads) together with the condition that the caller should not themselves reuse the input of unstreamPrimM
, which would be the case e.g. in a redefined mapM like mapPrimM f = unstreamPrimM . Bundle.mapM f . stream
that performs the streaming and unstreaming internally already.
we could expose PrimMonad-related utilities with more strict constraints if we could write such constraint. but what's the constraint we want?
Simply PrimMonad m =>
instead of Monad m =>
works, if unstreamM
is changed to unstreamPrimM
- or I suppose waiting for linear-base's linear Monad
to be production-ready would be a cleaner & more long-term solution.
Changing unsafeFreeze
to freeze
(which is just clone >=> unsafeFreeze
) gives me the following benchmark numbers:
freeze
benchmarking static/vector-unboxed/fillOrder1VF
time 1.220 μs (1.214 μs .. 1.227 μs)
benchmarking static/vector-storable/fillOrder1VF
time 1.362 μs (1.358 μs .. 1.365 μs)
unsafeFreeze
benchmarking static/vector-unboxed/fillOrder1VF
time 1.510 μs (1.503 μs .. 1.516 μs)
benchmarking static/vector-storable/fillOrder1VF
time 1.191 μs (1.189 μs .. 1.192 μs)
(I refactored the code slightly from what I had in the OP; latest version available in https://github.com/infinity0/lang-benchmarks)
It looks like performance is not significantly affected at least in this simple case (perhaps GHC optimises away the clone?), though it's incredibly weird that unsafeFreeze/unboxed is slower than freeze/unboxed (I ran this several times and got the same weird result).
Simply
PrimMonad m =>
instead ofMonad m =>
works, ifunstreamM
is changed tounstreamPrimM
-
ListT IO a
and ListT (ST s) a
(see ListT done right) or a monadic parser transformer applied to IO
is a "common" non-deterministic PrimMonad
. I wouldn't want to see non-experts puzzled by mysteriously transformed immutable vectors just because they used mapPrimM
against ParsecT e s (ST t)
. I don't know what maintainers think about this, though.
or I suppose waiting for linear-base's linear
Monad
to be production-ready would be a cleaner & more long-term solution.
That would be a sensible solution if it works. I haven't looked deep into GHC's linear monad yet, but it sounds promising.
It looks like performance is not significantly affected at least in this simple case (perhaps GHC optimises away the clone?), though it's incredibly weird that unsafeFreeze/unboxed is slower than freeze/unboxed (I ran this several times and got the same weird result).
See #409 for another example of behavior you observed. There is also some discussion about that there.
unstreamPrimM
is more performant and is unsafe. Unsafety alone should not stop us. We have full complement of unsafe functions in the name of performance.
What's crucial is to understand how exactly they're unsafe and when it's safe to use them. unstreamPrimM
is fast because it uses single mutable buffer in order to create vector. This of course causes problem in non-determinism monads. Here I'm using list-t
package:
import qualified Data.Vector.Unboxed as U
import qualified Data.Vector.Fusion.Bundle.Monadic as BM
import ListT
bundleList :: BM.Bundle (ListT IO) v Int
bundleList = BM.generateM 2 $ \i -> fromFoldable [i, 100+i]
goBad :: IO [U.Vector Int]
goBad = toList $ unstreamPrimM bundleList
>>> goBad
[[100,101],[100,101],[100,101],[100,101]]
Since all vector share same mutable buffer they all end up same! This causes problem since we continue to modify mutable vector after we called unsafeFreeze
. @gksato proposed to use freeze
and copy buffer. This does help for proper ListT
. It doesn't work in case of ListT
from old transformers. This is because we still have only single buffer and writes and freeze interleaved differently.
I suspect that this happens because ListT IO
from transformers is not proper monad, it violate monadic laws. So variant of unstreamPrimM
which uses freeze
could be safe for lawful monads. I however can't prove this.
I've been thinking of this safety issues these days. Of course unstreamPrimM
with freeze
is safe (in the sense can be marked TrustWorthy), since it does nothing unsound. What we have to make sure is that unstreamPrimM
and unstreamM
are observably equivalent. For that, I think we need lots of laws, including
monad laws:
- First of all, monad laws on the
PrimMonad m
. - The following
PrimMonad
laws onm
, which is undocumented. Note that they resembleMonadIO
laws.-
primitive (\s -> (# s, x #)) = return x
, or equivalentlystToPrim (return x) = return x
-
primitive (\s0 -> let (# s1, x #) = m# s0 in f# x s1) = primitive m# >>= primitive . f#
, or equivalentlystToPrim (m >>= f) = stToPrim m >>= stToPrim . f
-
-
Undocumented
MVector
laws on the pair ofv
anda
withMVector v a
, which includes but is absolutely not limited to the following. I guess it's appropriate to assume theST s
case, I don't know whether we can prove the generalPrimMonad
case from that base case (ST s
case).-
(,) <$> read v i <*> read w j = flip (,) <$> read w j <*> read v i
-
write v i a *> read v j = read v j <* write v i a
giveni /= j
. -
write v i a *> read v i = a <$ write v i a
-
write v i a *> write v j b = write v j b *> write v i a
giveni /= j
. -
write v i a *> write v i b = write v i b
-
(,) <$> write v i a <*> m = flip (,) <$> m <*> write v i a
wherem
is a "safe" operation that does not refer to a vector thatoverlap
s withv
.
-
And thanks to the explanation of @Shimuuar, I guess I got your point better, @infinity0. Obviously It's better if we have unstreamPrimM
exposed. Hypothetical beginners are not the only users of this library.