vector
vector copied to clipboard
Consider design change in light of GHC join point work
The join points paper suggests that it may now be possible to avoid the need for Skip
altogether. It would be worth checking if that's actually the case. If so, it might be possible to simplify the implementation considerably. Furthermore, I seem to recall that the presence of skips (which allow vectors of the same length to be represented by streams of different lengths) prevents one or more optimizations. I vaguely remember trying to do something a bit clever with replicate
and finding it impossible because of skips, but I don't remember any details.
... filter can't fuse without skips can it?
On Tue, Feb 21, 2017 at 2:49 PM David Feuer [email protected] wrote:
The join points paper https://www.microsoft.com/en-us/research/publication/compiling-without-continuations/ suggests that it may now be possible to avoid the need for Skip altogether. It would be worth checking if that's actually the case. If so, it might be possible to simplify the implementation considerably. Furthermore, I seem to recall that the presence of skips (which allow vectors of the same length to be represented by streams of different lengths) prevents one or more optimizations. I vaguely remember trying to do something a bit clever with replicate and finding it impossible because of skips, but I don't remember any details.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/haskell/vector/issues/156, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAQwtSu6MWDlo1qGWdrgLswl4Muet-aks5rez-1gaJpZM4MHx21 .
I believe the join point paper claims the new approach would allow it to do so, yes. Whether that is actually the case as implemented, I don't know.
On Feb 23, 2017 6:02 AM, "Carter Tazio Schonwald" [email protected] wrote:
... filter can't fuse without skips can it?
On Tue, Feb 21, 2017 at 2:49 PM David Feuer [email protected] wrote:
The join points paper <https://www.microsoft.com/en-us/research/publication/compiling-without- continuations/> suggests that it may now be possible to avoid the need for Skip altogether. It would be worth checking if that's actually the case. If so, it might be possible to simplify the implementation considerably. Furthermore, I seem to recall that the presence of skips (which allow vectors of the same length to be represented by streams of different lengths) prevents one or more optimizations. I vaguely remember trying to do something a bit clever with replicate and finding it impossible because of skips, but I don't remember any details.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/haskell/vector/issues/156, or mute the thread <https://github.com/notifications/unsubscribe-auth/ AAAQwtSu6MWDlo1qGWdrgLswl4Muet-aks5rez-1gaJpZM4MHx21> .
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/haskell/vector/issues/156#issuecomment-281962713, or mute the thread https://github.com/notifications/unsubscribe-auth/ABzi_d0j-U5DDGe2FmQYYxi0Dk9RF-_Vks5rfWdHgaJpZM4MHx21 .
The paper says the new join point handling makes it so that recursive stepper functions are no longer a problem. So Skip
is no longer needed.
However, making this change would break (fusion-wise) the functions that require Skip
on every compiler version that doesn't have the new join point handling. I assume that's everything before 8.2, or at best 8.0. So I'm not sure such a change can be justified for quite some time.
And we still need to measure impact on compile time/ code side / performance for vector.
That would change how loop breakers etc are chosen sometimes right? On Fri, Feb 24, 2017 at 12:50 AM dolio [email protected] wrote:
The paper says the new join point handling makes it so that recursive stepper functions are no longer a problem. So Skip is no longer needed.
However, making this change would break (fusion-wise) the functions that require Skip on every compiler version that doesn't have the new join point handling. I assume that's everything before 8.2, or at best 8.0. So I'm not sure such a change can be justified for quite some time.
— You are receiving this because you commented.
Reply to this email directly, view it on GitHub https://github.com/haskell/vector/issues/156#issuecomment-282210427, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAQwr2jc6OW3lYepHs9qySRvlRAcjpGks5rfm-lgaJpZM4MHx21 .
The one way I can think of to accelerate this kind of thing is to decouple the fusion from vectors. Have modules that provide plain vectors, and other modules that somehow add in fusion. Then give a choice of which fusion framework to use.
However, I'm not very sure how to accomplish something like this. The most obvious thing to say is, "this is where you want ML modules," because fusion is sort of a thing that transforms one implementation of an API into another. But backpack is probably more of a limiting factor than the join point stuff. Perhaps a newtype could work:
newtype Fused v a = F (v a)
Where you use Fused Vector a
if you want fusion to happen for your choice of Vector
.
Anyhow, this is a research project on top of a research project. :)
@treeowl whats the best way to explore the join point cleanup in your mind?
Any thoughts on this recently? From the paper it sound very interesting:
Result: simpler code, less of it, and faster to execute. It’s a straight win
Would it be worthwhile to do a small prototype to see e.g. how much it improves runtime and/or compiler performance? Perhaps some of the zip or zipWith functions ? I picked those based on the following from the paper, but I'm not sure if it applies to zip for vectors:
Now the stepper function can say to update the state and call again, obviating the need for a loop of its own. This makes filter fusible, but it complicates everything else! Everything gets three cases instead of two, leading to more code and more runtime tests; and functions like zip that consume two lists become more complicated and less efficient.
It is interesting, but it is also an enormous amount of work. @GeorgeCo You feel like working on this and submitting a PR with a prototype? If I were to work on this I would start with implementing and verifying performance of filter
, because that is exactly what we need Skip
for.
We already have a plan for putting all the streaming functionality that uses Stream
and Step
types into a separate package #355 So if this approach with relying on join points proves itself as feasible replacement for newer ghc we could then have two stream packages: one for older ghc with Skip
and another package for newer ghc versions that doesn't have to rely on Skip
for performance.
I've just done a quick scan over the code. Now I'm stuck and it seems to me like it is impossible to remove Skip
for monadic streams. Take for example the drop
function. I've removed the Step
constructor and replaced its usage by a recursive call:
drop :: Monad m => Int -> Stream m a -> Stream m a
{-# INLINE_FUSED drop #-}
drop n (Stream step t) = Stream step' (t, Just n)
where
{-# INLINE_INNER step' #-}
step' (s, Just i) | i > 0 = do
r <- step s
case r of
Yield _ s' -> step' (s', Just (i-1))
Done -> return Done
step' (s, _) = liftM (\r ->
case r of
Yield x s' -> Yield x (s', Nothing)
Done -> Done
) (step s)
But here the step' (s', Just (i - 1))
recursive call is not guaranteed to be a tail call (that depends on what the Monad m
is), so GHC can't mark step'
as a join point in general (if I understand it correctly).
Or is there another way to write this so that it can use a join point?
This is a very cool observation, would having a flavor of monads that guarantee they respect tail calls, in a way ghc understands, solve that?
I’ve hit a similar technical issue in some of my own prjects over time, though with different motivations. I’ll ask around
On Mon, Sep 26, 2022 at 11:07 AM Jaro Reinders @.***> wrote:
I've just done a quick scan over the code. It seems to me like it is impossible to remove Skip for monadic streams. Take for example the drop function. I've removed the Step constructor and replaced its usage by a recursive call:
drop :: Monad m => Int -> Stream m a -> Stream m a {-# INLINE_FUSED drop #-}drop n (Stream step t) = Stream step' (t, Just n) where {-# INLINE_INNER step' #-} step' (s, Just i) | i > 0 = do r <- step s case r of Yield _ s' -> step' (s', Just (i-1)) Done -> return Done
step' (s, _) = liftM (\r -> case r of Yield x s' -> Yield x (s', Nothing) Done -> Done ) (step s)
But here the step' (s', Just (i - 1)) recursive call is not guaranteed to be a tail call (that depends on what the Monad m is), so GHC can't mark step' as a join point in general (if I understand it correctly).
— Reply to this email directly, view it on GitHub https://github.com/haskell/vector/issues/156#issuecomment-1258181207, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAABBQXHQZAQDOBSHR2XCUTWAG33LANCNFSM4DA7DW2Q . You are receiving this because you commented.Message ID: @.***>
would having a flavor of monads that guarantee they respect tail calls, in a way ghc understands, solve that?
That's exactly what I was thinking!
I wonder what that looks like. I know somebody asked SPJ once after one of his talks if join points could be made explicit in the surface language and SPJ considered the possibility but didn't see an immediate use case.
I also think this is related to Alexis King's "Effects for Less" talk which mentions that it is really important that the monadic bind operator gets inlined.
Perhaps if we use her Eff
monad instead of arbitrary monads in the streams then we would not have this issue.
It might be worthwhile to draw SPJ and the other author of the paper into this conversation.
ping @simonpj @pdownen @lukemaurer (I couldn't find Zena Ariola's github account if she even has one)
Now I'm stuck and it seems to me like it is impossible to remove Skip for monadic streams. T
We are taking about fusion here, correct? That's only going to happen if you inine the function (in this case drop
at the call site, else there is nothing to fuse with. And once you inline, you likely have fixed the monad.
I rather doubt that you'll achieve fusion for an arbitrary monad.
A good way to move this converation on would be to provide a concrete example of
- a program where you think fusion should take place
- the fused program that you think GHC should be able to reach by program transformations.
Being concrete aids comprehension, and saves misinderstandings.
Oh, that sounds right. Now I feel silly for missing that.
To be much more concrete. Take for example the Quickhull benchmark. Before my changes it takes 55.7 ms and after removing Skip it takes 72.2 ms. Looking at the core reveals a bunch of bindings like the following which I think are to blame for this discrepancy. These are not present in the core dumps from before my changes.
letrec {
$s$wstep'_sebJ :: Int# -> Int# ->
Id (Step (Int, Int, Maybe (Double, Double)) ((Double, Double), Double))
$s$wstep'_sebJ
= \ (sc_sebF :: Int#) (sc1_sebE :: Int#) ->
case >=# sc1_sebE ww_sdOU of {
__DEFAULT ->
case indexDoubleArray# ww2_sdP0 (+# ww1_sdOY sc1_sebE) of wild2_a6SW { __DEFAULT ->
case indexDoubleArray# ww4_sdP7 (+# ww3_sdP5 sc1_sebE) of wild1_Xe { __DEFAULT ->
case >=# sc_sebF ww7_sdOz of {
__DEFAULT ->
case indexDoubleArray# ipv3_adv1 sc_sebF of wild3_Xh { __DEFAULT ->
case >## wild3_Xh 0.0## of {
__DEFAULT -> $s$wstep'_sebJ (+# sc_sebF 1#) (+# sc1_sebE 1#); -- <<< recursive call here
1# ->
(Yield
((D# wild2_a6SW, D# wild1_Xe), D# wild3_Xh)
(I# (+# sc1_sebE 1#), I# (+# sc_sebF 1#), Nothing))
`cast` <Co:15>
}};
1# -> Done `cast` <Co:15>
}}};
1# -> Done `cast` <Co:15>
}; } in
Here are the full pre and post dumps of all the benchmarks: bench-dumps.zip
And the particular benchmarks that got noticeably worse are: listrank, awshcc, hybcc, quickhull, findIndexR, findIndexR_manual. Especially awshcc got more than 3x slower and hybcc about 2x slower.
Maybe it is some kind of cross-function case of case optimization that is lacking. After reducing an example to just a V.map (* 2) . V.filter (> 10)
I get this core:
letrec {
$wstep'_s3Uf :: Int# -> Id (Step Int Double)
$wstep'_s3Uf
= \ (ww2_s3Ud :: Int#) ->
case >=# ww2_s3Ud ipv1_s3Po of {
__DEFAULT ->
case indexDoubleArray# ipv2_s3Pp (+# ipv_s3Pn ww2_s3Ud)
of wild2_i3d4
{ __DEFAULT ->
case >## wild2_i3d4 10.0## of {
__DEFAULT -> $wstep'_s3Uf (+# ww2_s3Ud 1#);
1# ->
(Yield (D# wild2_i3d4) (I# (+# ww2_s3Ud 1#))) `cast` <Co:5>
}
};
1# -> Done `cast` <Co:5>
}; } in
joinrec {
$s$wfoldlM'_loop_s3WN
:: State# RealWorld -> Int# -> Int# -> Vector Double
$s$wfoldlM'_loop_s3WN (sc_s3WM :: State# RealWorld)
(sc1_s3WK :: Int#)
(sc2_s3WJ :: Int#)
= case ($wstep'_s3Uf sc1_s3WK) `cast` <Co:4> of { -- <<< $wstep' can't be a join point
Yield x_a3P4 s'_a3P5 ->
case x_a3P4 of { D# x1_a2wq ->
case writeDoubleArray#
ipv4_i3En
sc2_s3WJ
(+## x1_a2wq x1_a2wq)
(sc_s3WM `cast` <Co:5>)
of s'#_i3S3
{ __DEFAULT ->
case s'_a3P5 of { I# ww3_X4 ->
jump $s$wfoldlM'_loop_s3WN
(s'#_i3S3 `cast` <Co:4>) ww3_X4 (+# sc2_s3WJ 1#)
}
}
};
Done ->
case unsafeFreezeByteArray# ipv4_i3En (sc_s3WM `cast` <Co:5>) of
{ (# ipv5_i3ci, ipv6_i3cj #) ->
(Vector 0# sc2_s3WJ ipv6_i3cj) `cast` <Co:5>
}
}; } in
jump $s$wfoldlM'_loop_s3WN (ipv3_i3Em `cast` <Co:4>) 0# 0#
Maybe I'm running into https://gitlab.haskell.org/ghc/ghc/-/issues/16335 again?
Maybe this is too hard, but could you distil out a standalone repro case?
Or (less good but still OK) supply precise repro instructions for your "reduced example". By precise I mean "check out this repo, cd to there, execute these commands..."
Looking at the code you show, I think that FloatIn should move the binding for $wstep'
inwards, into the scrutinee of the case; there it will become a join point and good things should happen. I don't know why this doesn't happen.
Here's a standalone reproducer:
https://gist.github.com/noughtmare/f979c0fc9eed30abcd0849d19cf0f746
with the interesting core here:
https://gist.github.com/noughtmare/f979c0fc9eed30abcd0849d19cf0f746#file-all-dump-simpl-L2737-L2794
Commands you can run:
wget https://gist.githubusercontent.com/noughtmare/f979c0fc9eed30abcd0849d19cf0f746/raw/858db7da16a5ba542148d794655057bed33f88b3/All.hs
ghc -package ghc-prim -package primitive -O2 -ddump-simpl -dusppress-all -dno-suppress-type-signatures -ddump-to-file All.hs
If I translate the produced core back to Haskell:
module T where
import Control.Monad.Primitive
import Data.Primitive.Array
import System.IO.Unsafe (unsafePerformIO)
data Step s a = Yield a s | Done
uninitialised = undefined
test :: Int -> Int -> Array Double -> (Int, Int, Array Double)
test off n oldArr = unsafePerformIO $ do
newArr <- newArray n uninitialised
let
step' i
| i >= n = Done
| otherwise =
let x = indexArray oldArr (off + i) in
if x > 10
then Yield x (i + 1)
else step' (i + 1)
loop i j = do
case step' i of
Yield x s' -> do
writeArray newArr j (x + 1)
loop s' (j + 1)
Done -> do
out <- unsafeFreezeArray newArr
return (0, j, out)
loop 0 0
And then compile that again I do get the nice core with two join points where the intermediate Yield
and Done
constructors are eliminated:
$wtest :: Int -> Int# -> Array Double -> (Int, Int, Array Double)
$wtest
= \ (w :: Int) (ww :: Int#) (w1 :: Array Double) ->
runRW#
(\ (s :: State# RealWorld) ->
case noDuplicate# s of s' { __DEFAULT ->
case newArray# ww uninitialised (s' `cast` <Co:3>) of
{ (# ipv, ipv1 #) ->
joinrec {
$s$wloop
:: State# RealWorld -> Int# -> Int# -> (Int, Int, Array Double)
$s$wloop (sc :: State# RealWorld) (sc1 :: Int#) (sc2 :: Int#)
= join {
$j :: (Int, Int, Array Double)
$j
= case unsafeFreezeArray# ipv1 (sc `cast` <Co:3>) of
{ (# ipv2, ipv3 #) ->
lazy (test1, I# sc1, Array ipv3)
} } in
case >=# sc2 ww of {
__DEFAULT ->
case w of { I# x ->
case w1 of { Array ds2 ->
case indexArray# ds2 (+# x sc2) of { (# ipv2 #) ->
case ipv2 of { D# x1 ->
case >## x1 10.0## of {
__DEFAULT ->
joinrec {
$wstep' :: Int# -> (Int, Int, Array Double)
$wstep' (ww1 :: Int#)
= case >=# ww1 ww of {
__DEFAULT ->
case indexArray# ds2 (+# x ww1) of { (# ipv3 #) ->
case ipv3 of { D# x2 ->
case >## x2 10.0## of {
__DEFAULT -> jump $wstep' (+# ww1 1#);
1# ->
case writeArray#
ipv1 sc1 (D# (+## x2 1.0##)) (sc `cast` <Co:3>)
of s'#
{ __DEFAULT ->
jump $s$wloop (s'# `cast` <Co:2>) (+# sc1 1#) (+# ww1 1#)
}
} } };
1# -> jump $j
}; } in
jump $wstep' (+# sc2 1#);
1# ->
case writeArray# ipv1 sc1 (D# (+## x1 1.0##)) (sc `cast` <Co:3>)
of s'#
{ __DEFAULT ->
jump $s$wloop (s'# `cast` <Co:2>) (+# sc1 1#) (+# sc2 1#)
}
} } } } };
1# -> jump $j
}; } in
jump $s$wloop (ipv `cast` <Co:2>) 0# 0#
} })
So is the simplifier just running out of steam?
Hi Jaro
Good progress on this! wrt
So is the simplifier just running out of steam?
Did you try any of the increasing the count of any of the simplifier options in https://downloads.haskell.org/ghc/latest/docs/users_guide/using-optimisation.html ?
Unfortunately I am very far from being an expert but the one that looked most likely to be was the following and the associated -fddump-simpl-stats
-fsimpl-tick-factor=⟨n⟩ Default: 100
GHC’s optimiser can diverge if you write rewrite rules (Rewrite rules https://downloads.haskell.org/ghc/latest/docs/users_guide/exts/rewrite_rules.html#rewrite-rules) that don’t terminate, or (less satisfactorily) if you code up recursion through data types (Bugs in GHC https://downloads.haskell.org/ghc/latest/docs/users_guide/bugs.html#bugs-ghc). To avoid making the compiler fall into an infinite loop, the optimiser carries a “tick count” and stops inlining and applying rewrite rules when this count is exceeded. The limit is set as a multiple of the program size, so bigger programs get more ticks. The -fsimpl-tick-factor flag lets you change the multiplier. The default is 100; numbers larger than 100 give more ticks, and numbers smaller than 100 give fewer.
If the tick-count expires, GHC summarises what simplifier steps it has done; you can use -fddump-simpl-stats to generate a much more detailed list. Usually that identifies the loop quite accurately, because some numbers are very large.
On Tue, Sep 27, 2022 at 11:17 AM Jaro Reinders @.***> wrote:
If I translate the produced core back to Haskell:
module T where import Control.Monad.Primitiveimport Data.Primitive.Arrayimport System.IO.Unsafe (unsafePerformIO) data Step s a = Yield a s | Done
uninitialised = undefined test :: Int -> Int -> Array Double -> (Int, Int, Array Double) test off n oldArr = unsafePerformIO $ do newArr <- newArray n uninitialised let step' i | i >= n = Done | otherwise = let x = indexArray oldArr (off + i) in if x > 10 then Yield x (i + 1) else step' (i + 1) loop i sc2 = do let c = step' i case c of Yield x s' -> do writeArray newArr sc2 (x + 1) loop s' (sc2 + 1) Done -> do out <- unsafeFreezeArray newArr return (0, sc2, out) loop 0 0
And then compile that again I do get the nice core with two join points where the intermediate Yield and Done constructors are eliminated:
$wtest :: Int -> Int# -> Array Double -> (Int, Int, Array Double)$wtest = \ (w :: Int) (ww :: Int#) (w1 :: Array Double) -> runRW# (\ (s :: State# RealWorld) -> case noDuplicate# s of s' { __DEFAULT -> case newArray# ww uninitialised (s'
cast
Co:3) of { (# ipv, ipv1 #) -> joinrec { $s$wloop :: State# RealWorld -> Int# -> Int# -> (Int, Int, Array Double) $s$wloop (sc :: State# RealWorld) (sc1 :: Int#) (sc2 :: Int#) = join { $j :: (Int, Int, Array Double) $j = case unsafeFreezeArray# ipv1 (sccast
Co:3) of { (# ipv2, ipv3 #) -> lazy (test1, I# sc1, Array ipv3) } } in case >=# sc2 ww of { __DEFAULT -> case w of { I# x -> case w1 of { Array ds2 -> case indexArray# ds2 (+# x sc2) of { (# ipv2 #) -> case ipv2 of { D# x1 -> case >## x1 10.0## of { __DEFAULT -> joinrec { $wstep' :: Int# -> (Int, Int, Array Double) $wstep' (ww1 :: Int#) = case >=# ww1 ww of { __DEFAULT -> case indexArray# ds2 (+# x ww1) of { (# ipv3 #) -> case ipv3 of { D# x2 -> case >## x2 10.0## of { __DEFAULT -> jump $wstep' (+# ww1 1#); 1# -> case writeArray# ipv1 sc1 (D# (+## x2 1.0##)) (sccast
Co:3) of s'# { __DEFAULT -> jump $s$wloop (s'#cast
Co:2) (+# sc1 1#) (+# ww1 1#) } } } }; 1# -> jump $j }; } in jump $wstep' (+# sc2 1#); 1# -> case writeArray# ipv1 sc1 (D# (+## x1 1.0##)) (sccast
Co:3) of s'# { __DEFAULT -> jump $s$wloop (s'#cast
Co:2) (+# sc1 1#) (+# sc2 1#) } } } } } }; 1# -> jump $j }; } in jump $s$wloop (ipvcast
Co:2) 0# 0# } })So is the simplifier just running out of steam?
— Reply to this email directly, view it on GitHub https://github.com/haskell/vector/issues/156#issuecomment-1259576807, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABQIJ65QW6AXKBWUDW4Q7ODWAL6XHANCNFSM4DA7DW2Q . You are receiving this because you were mentioned.Message ID: @.***>
I did try increasing the max simplifier iterations which did not help.
But now I've rewritten it using the lower level unboxed operations:
{-# LANGUAGE MagicHash #-}
{-# LANGUAGE UnboxedTuples #-}
module T where
import GHC.Exts
import GHC.IO
data Step s a = Yield a s | Done
uninitialised = undefined
test :: Int# -> Int# -> Array# Double -> (# Int#, Int#, Array# Double #)
test off n oldArr = runRW# $ \s0 ->
case newArray# n uninitialised s0
of { (# s1, newArr #) ->
let
step' i
| isTrue# (i >=# n) = Done
| otherwise =
let (# D# x #) = indexArray# oldArr (off +# i) in
if isTrue# (x >## 10.0##)
then Yield (D# x) (I# (i +# 1#))
else step' (i +# 1#)
loop i j s2 =
case step' i of
Yield x (I# s') ->
case writeArray# newArr j (x + 1) s2
of { s3 ->
loop s' (j +# 1#) s3
}
Done ->
case unsafeFreezeArray# newArr s2
of { (# s3, out #) ->
(# 0#, j, out #)
}
in
loop 0# 0# s1
}
And compiling this program does show the optimisation failure:
test
= \ off n oldArr ->
runRW#
(\ s0 ->
case newArray# n uninitialised s0 of { (# ipv, ipv1 #) ->
letrec {
step'
= \ i ->
case >=# i n of {
__DEFAULT ->
case indexArray# oldArr (+# off i) of { (# ipv2 #) ->
case ipv2 of wild { D# x ->
case >## x 10.0## of {
__DEFAULT -> step' (+# i 1#);
1# -> Yield wild (I# (+# i 1#))
}
}
};
1# -> Done
}; } in
join {
exit j s2
= case unsafeFreezeArray# ipv1 s2 of { (# ipv2, ipv3 #) ->
(# 0#, j, ipv3 #)
} } in
joinrec {
loop i j s2
= case step' i of {
Yield x ds1 ->
case ds1 of { I# s' ->
case writeArray#
ipv1 j (case x of { D# x1 -> D# (+## x1 1.0##) }) s2
of s3
{ __DEFAULT ->
jump loop s' (+# j 1#) s3
}
};
Done -> jump exit j s2
}; } in
jump loop 0# 0# ipv
})
I think it is time to open a proper GHC issue for this, so I've done that: https://gitlab.haskell.org/ghc/ghc/-/issues/22227
this is great :)
On Tue, Sep 27, 2022 at 11:08 AM Jaro Reinders @.***> wrote:
I think it is time to open a proper GHC issue for this, so I've done that: https://gitlab.haskell.org/ghc/ghc/-/issues/22227
— Reply to this email directly, view it on GitHub https://github.com/haskell/vector/issues/156#issuecomment-1259647533, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAABBQVAEPOXFMAY4EJVU4DWAMEWBANCNFSM4DA7DW2Q . You are receiving this because you commented.Message ID: @.***>
Good news! I have been able to manually implement the "loopification" optimization that GHC is missing (see this WIP merge request of GHC) with https://github.com/haskell/vector/pull/448/commits/cfaf0d4b0102b294784378bc3d77e6c7ceb2407f. Now almost all benchmarks are the same or faster (tridiag speeds up the most from 86ms to 19.3ms) except awshcc which is still about 2x slower (406ms to 872ms). I'll look into that now. For now, I can already say is that it is able to fuse properly so the problem must be something else.
Wow !
Would love to see the table of timing changes
I've now reached the point that all benchmarks are (approximately) equal or faster, see https://github.com/haskell/vector/pull/448#issuecomment-1299110013.
@noughtmare That is all pretty great news.
If I understand correctly the proper speed up is only observed with the ghc patch
How does the regression look for all the older ghc versions. We do support all the way down to ghc-8.0
@lehins
If I understand correctly the proper speed up is only observed with the ghc patch
No, I've manually implemented the optimization that GHC would do with that patch.
For now I've tested with GHC 9.2.4. I'll do some more tests with other GHC versions, but I believe this only really requires join points (I don't know when those were introduced, but I think GHC 8 already had them).
Now I understand, that is great! Awesome work!
Let's make sure we don't mess up performance for anyone and I don't see a reason why we shouldn't get this change into vector.