vector Consider design change in light of GHC join point work

The join points paper suggests that it may now be possible to avoid the need for Skip altogether. It would be worth checking if that's actually the case. If so, it might be possible to simplify the implementation considerably. Furthermore, I seem to recall that the presence of skips (which allow vectors of the same length to be represented by streams of different lengths) prevents one or more optimizations. I vaguely remember trying to do something a bit clever with replicate and finding it impossible because of skips, but I don't remember any details.

Feb 21 '17 19:02 treeowl

... filter can't fuse without skips can it?

On Tue, Feb 21, 2017 at 2:49 PM David Feuer [email protected] wrote:

The join points paper https://www.microsoft.com/en-us/research/publication/compiling-without-continuations/ suggests that it may now be possible to avoid the need for Skip altogether. It would be worth checking if that's actually the case. If so, it might be possible to simplify the implementation considerably. Furthermore, I seem to recall that the presence of skips (which allow vectors of the same length to be represented by streams of different lengths) prevents one or more optimizations. I vaguely remember trying to do something a bit clever with replicate and finding it impossible because of skips, but I don't remember any details.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/haskell/vector/issues/156, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAQwtSu6MWDlo1qGWdrgLswl4Muet-aks5rez-1gaJpZM4MHx21 .

Feb 23 '17 11:02 cartazio

I believe the join point paper claims the new approach would allow it to do so, yes. Whether that is actually the case as implemented, I don't know.

On Feb 23, 2017 6:02 AM, "Carter Tazio Schonwald" [email protected] wrote:

... filter can't fuse without skips can it?

On Tue, Feb 21, 2017 at 2:49 PM David Feuer [email protected] wrote:

The join points paper <https://www.microsoft.com/en-us/research/publication/compiling-without- continuations/> suggests that it may now be possible to avoid the need for Skip altogether. It would be worth checking if that's actually the case. If so, it might be possible to simplify the implementation considerably. Furthermore, I seem to recall that the presence of skips (which allow vectors of the same length to be represented by streams of different lengths) prevents one or more optimizations. I vaguely remember trying to do something a bit clever with replicate and finding it impossible because of skips, but I don't remember any details.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/haskell/vector/issues/156, or mute the thread <https://github.com/notifications/unsubscribe-auth/ AAAQwtSu6MWDlo1qGWdrgLswl4Muet-aks5rez-1gaJpZM4MHx21> .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/haskell/vector/issues/156#issuecomment-281962713, or mute the thread https://github.com/notifications/unsubscribe-auth/ABzi_d0j-U5DDGe2FmQYYxi0Dk9RF-_Vks5rfWdHgaJpZM4MHx21 .

Feb 23 '17 11:02 treeowl

The paper says the new join point handling makes it so that recursive stepper functions are no longer a problem. So Skip is no longer needed.

However, making this change would break (fusion-wise) the functions that require Skip on every compiler version that doesn't have the new join point handling. I assume that's everything before 8.2, or at best 8.0. So I'm not sure such a change can be justified for quite some time.

Feb 24 '17 05:02 dolio

And we still need to measure impact on compile time/ code side / performance for vector.

That would change how loop breakers etc are chosen sometimes right? On Fri, Feb 24, 2017 at 12:50 AM dolio [email protected] wrote:

The paper says the new join point handling makes it so that recursive stepper functions are no longer a problem. So Skip is no longer needed.

However, making this change would break (fusion-wise) the functions that require Skip on every compiler version that doesn't have the new join point handling. I assume that's everything before 8.2, or at best 8.0. So I'm not sure such a change can be justified for quite some time.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/haskell/vector/issues/156#issuecomment-282210427, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAQwr2jc6OW3lYepHs9qySRvlRAcjpGks5rfm-lgaJpZM4MHx21 .

Feb 24 '17 17:02 cartazio

The one way I can think of to accelerate this kind of thing is to decouple the fusion from vectors. Have modules that provide plain vectors, and other modules that somehow add in fusion. Then give a choice of which fusion framework to use.

However, I'm not very sure how to accomplish something like this. The most obvious thing to say is, "this is where you want ML modules," because fusion is sort of a thing that transforms one implementation of an API into another. But backpack is probably more of a limiting factor than the join point stuff. Perhaps a newtype could work:

newtype Fused v a = F (v a)

Where you use Fused Vector a if you want fusion to happen for your choice of Vector.

Anyhow, this is a research project on top of a research project. :)

Feb 26 '17 20:02 dolio

@treeowl whats the best way to explore the join point cleanup in your mind?

Aug 05 '19 17:08 cartazio

Any thoughts on this recently? From the paper it sound very interesting:

Result: simpler code, less of it, and faster to execute. It’s a straight win

Would it be worthwhile to do a small prototype to see e.g. how much it improves runtime and/or compiler performance? Perhaps some of the zip or zipWith functions ? I picked those based on the following from the paper, but I'm not sure if it applies to zip for vectors:

Now the stepper function can say to update the state and call again, obviating the need for a loop of its own. This makes filter fusible, but it complicates everything else! Everything gets three cases instead of two, leading to more code and more runtime tests; and functions like zip that consume two lists become more complicated and less efficient.

Mar 21 '21 17:03 GeorgeCo

It is interesting, but it is also an enormous amount of work. @GeorgeCo You feel like working on this and submitting a PR with a prototype? If I were to work on this I would start with implementing and verifying performance of filter, because that is exactly what we need Skip for.

We already have a plan for putting all the streaming functionality that uses Stream and Step types into a separate package #355 So if this approach with relying on join points proves itself as feasible replacement for newer ghc we could then have two stream packages: one for older ghc with Skip and another package for newer ghc versions that doesn't have to rely on Skip for performance.

Mar 21 '21 18:03 lehins

I've just done a quick scan over the code. Now I'm stuck and it seems to me like it is impossible to remove Skip for monadic streams. Take for example the drop function. I've removed the Step constructor and replaced its usage by a recursive call:

drop :: Monad m => Int -> Stream m a -> Stream m a
{-# INLINE_FUSED drop #-}
drop n (Stream step t) = Stream step' (t, Just n)
  where
    {-# INLINE_INNER step' #-}
    step' (s, Just i) | i > 0 = do
                                  r <- step s
                                  case r of
                                     Yield _ s' -> step' (s', Just (i-1))
                                     Done       -> return Done

    step' (s, _) = liftM (\r ->
                     case r of
                       Yield x s' -> Yield x (s', Nothing)
                       Done       -> Done
                   ) (step s)

But here the step' (s', Just (i - 1)) recursive call is not guaranteed to be a tail call (that depends on what the Monad m is), so GHC can't mark step' as a join point in general (if I understand it correctly).

Or is there another way to write this so that it can use a join point?

Sep 26 '22 15:09 noughtmare

This is a very cool observation, would having a flavor of monads that guarantee they respect tail calls, in a way ghc understands, solve that?

I’ve hit a similar technical issue in some of my own prjects over time, though with different motivations. I’ll ask around

On Mon, Sep 26, 2022 at 11:07 AM Jaro Reinders @.***> wrote:

I've just done a quick scan over the code. It seems to me like it is impossible to remove Skip for monadic streams. Take for example the drop function. I've removed the Step constructor and replaced its usage by a recursive call:

drop :: Monad m => Int -> Stream m a -> Stream m a {-# INLINE_FUSED drop #-}drop n (Stream step t) = Stream step' (t, Just n) where {-# INLINE_INNER step' #-} step' (s, Just i) | i > 0 = do r <- step s case r of Yield _ s' -> step' (s', Just (i-1)) Done -> return Done
step' (s, _) = liftM (\r ->
                 case r of
                   Yield x s' -> Yield x (s', Nothing)
                   Done       -> Done
               ) (step s)
But here the step' (s', Just (i - 1)) recursive call is not guaranteed to be a tail call (that depends on what the Monad m is), so GHC can't mark step' as a join point in general (if I understand it correctly).

— Reply to this email directly, view it on GitHub https://github.com/haskell/vector/issues/156#issuecomment-1258181207, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAABBQXHQZAQDOBSHR2XCUTWAG33LANCNFSM4DA7DW2Q . You are receiving this because you commented.Message ID: @.***>

Sep 26 '22 16:09 cartazio

would having a flavor of monads that guarantee they respect tail calls, in a way ghc understands, solve that?

That's exactly what I was thinking!

I wonder what that looks like. I know somebody asked SPJ once after one of his talks if join points could be made explicit in the surface language and SPJ considered the possibility but didn't see an immediate use case.

I also think this is related to Alexis King's "Effects for Less" talk which mentions that it is really important that the monadic bind operator gets inlined.

Perhaps if we use her Eff monad instead of arbitrary monads in the streams then we would not have this issue.

Sep 26 '22 17:09 noughtmare

It might be worthwhile to draw SPJ and the other author of the paper into this conversation.

Sep 26 '22 17:09 GeorgeCo

ping @simonpj @pdownen @lukemaurer (I couldn't find Zena Ariola's github account if she even has one)

Sep 26 '22 17:09 noughtmare

Now I'm stuck and it seems to me like it is impossible to remove Skip for monadic streams. T

We are taking about fusion here, correct? That's only going to happen if you inine the function (in this case drop at the call site, else there is nothing to fuse with. And once you inline, you likely have fixed the monad.

I rather doubt that you'll achieve fusion for an arbitrary monad.

A good way to move this converation on would be to provide a concrete example of

a program where you think fusion should take place
the fused program that you think GHC should be able to reach by program transformations.

Being concrete aids comprehension, and saves misinderstandings.

Sep 26 '22 19:09 simonpj

Oh, that sounds right. Now I feel silly for missing that.

To be much more concrete. Take for example the Quickhull benchmark. Before my changes it takes 55.7 ms and after removing Skip it takes 72.2 ms. Looking at the core reveals a bunch of bindings like the following which I think are to blame for this discrepancy. These are not present in the core dumps from before my changes.

letrec {
  $s$wstep'_sebJ :: Int# -> Int# ->
    Id (Step (Int, Int, Maybe (Double, Double)) ((Double, Double), Double))
  $s$wstep'_sebJ
    = \ (sc_sebF :: Int#) (sc1_sebE :: Int#) ->
        case >=# sc1_sebE ww_sdOU of {
          __DEFAULT ->
            case indexDoubleArray# ww2_sdP0 (+# ww1_sdOY sc1_sebE) of wild2_a6SW { __DEFAULT ->
            case indexDoubleArray# ww4_sdP7 (+# ww3_sdP5 sc1_sebE) of wild1_Xe { __DEFAULT ->
            case >=# sc_sebF ww7_sdOz of {
              __DEFAULT ->
                case indexDoubleArray# ipv3_adv1 sc_sebF of wild3_Xh { __DEFAULT ->
                case >## wild3_Xh 0.0## of {
                  __DEFAULT -> $s$wstep'_sebJ (+# sc_sebF 1#) (+# sc1_sebE 1#); -- <<< recursive call here
                  1# ->
                    (Yield
                       ((D# wild2_a6SW, D# wild1_Xe), D# wild3_Xh)
                       (I# (+# sc1_sebE 1#), I# (+# sc_sebF 1#), Nothing))
                    `cast` <Co:15>
                }};
              1# -> Done `cast` <Co:15>
            }}};
          1# -> Done `cast` <Co:15>
        }; } in

Here are the full pre and post dumps of all the benchmarks: bench-dumps.zip

And the particular benchmarks that got noticeably worse are: listrank, awshcc, hybcc, quickhull, findIndexR, findIndexR_manual. Especially awshcc got more than 3x slower and hybcc about 2x slower.

Sep 26 '22 20:09 noughtmare

Maybe it is some kind of cross-function case of case optimization that is lacking. After reducing an example to just a V.map (* 2) . V.filter (> 10) I get this core:

letrec {
  $wstep'_s3Uf :: Int# -> Id (Step Int Double)
  $wstep'_s3Uf
    = \ (ww2_s3Ud :: Int#) ->
        case >=# ww2_s3Ud ipv1_s3Po of {
          __DEFAULT ->
            case indexDoubleArray# ipv2_s3Pp (+# ipv_s3Pn ww2_s3Ud)
            of wild2_i3d4
            { __DEFAULT ->
            case >## wild2_i3d4 10.0## of {
              __DEFAULT -> $wstep'_s3Uf (+# ww2_s3Ud 1#);
              1# ->
                (Yield (D# wild2_i3d4) (I# (+# ww2_s3Ud 1#))) `cast` <Co:5>
            }
            };
          1# -> Done `cast` <Co:5>
        }; } in
joinrec {
  $s$wfoldlM'_loop_s3WN
    :: State# RealWorld -> Int# -> Int# -> Vector Double
  $s$wfoldlM'_loop_s3WN (sc_s3WM :: State# RealWorld)
                        (sc1_s3WK :: Int#)
                        (sc2_s3WJ :: Int#)
    = case ($wstep'_s3Uf sc1_s3WK) `cast` <Co:4> of { -- <<< $wstep' can't be a join point
        Yield x_a3P4 s'_a3P5 ->
          case x_a3P4 of { D# x1_a2wq ->
          case writeDoubleArray#
                 ipv4_i3En
                 sc2_s3WJ
                 (+## x1_a2wq x1_a2wq)
                 (sc_s3WM `cast` <Co:5>)
          of s'#_i3S3
          { __DEFAULT ->
          case s'_a3P5 of { I# ww3_X4 ->
          jump $s$wfoldlM'_loop_s3WN
            (s'#_i3S3 `cast` <Co:4>) ww3_X4 (+# sc2_s3WJ 1#)
          }
          }
          };
        Done ->
          case unsafeFreezeByteArray# ipv4_i3En (sc_s3WM `cast` <Co:5>) of
          { (# ipv5_i3ci, ipv6_i3cj #) ->
          (Vector 0# sc2_s3WJ ipv6_i3cj) `cast` <Co:5>
          }
      }; } in
jump $s$wfoldlM'_loop_s3WN (ipv3_i3Em `cast` <Co:4>) 0# 0#

Maybe I'm running into https://gitlab.haskell.org/ghc/ghc/-/issues/16335 again?

Sep 26 '22 21:09 noughtmare

Maybe this is too hard, but could you distil out a standalone repro case?

Or (less good but still OK) supply precise repro instructions for your "reduced example". By precise I mean "check out this repo, cd to there, execute these commands..."

Looking at the code you show, I think that FloatIn should move the binding for $wstep' inwards, into the scrutinee of the case; there it will become a join point and good things should happen. I don't know why this doesn't happen.

Sep 26 '22 22:09 simonpj

Here's a standalone reproducer:

https://gist.github.com/noughtmare/f979c0fc9eed30abcd0849d19cf0f746

with the interesting core here:

https://gist.github.com/noughtmare/f979c0fc9eed30abcd0849d19cf0f746#file-all-dump-simpl-L2737-L2794

Commands you can run:

wget https://gist.githubusercontent.com/noughtmare/f979c0fc9eed30abcd0849d19cf0f746/raw/858db7da16a5ba542148d794655057bed33f88b3/All.hs
ghc -package ghc-prim -package primitive -O2 -ddump-simpl -dusppress-all -dno-suppress-type-signatures -ddump-to-file All.hs

Sep 27 '22 10:09 noughtmare

If I translate the produced core back to Haskell:

module T where

import Control.Monad.Primitive
import Data.Primitive.Array
import System.IO.Unsafe (unsafePerformIO)

data Step s a = Yield a s | Done

uninitialised = undefined

test :: Int -> Int -> Array Double -> (Int, Int, Array Double)
test off n oldArr = unsafePerformIO $ do
  newArr <- newArray n uninitialised
  let
    step' i
      | i >= n = Done
      | otherwise =
        let x = indexArray oldArr (off + i) in
        if x > 10
        then Yield x (i + 1)
        else step' (i + 1)
    loop i j = do
      case step' i of
        Yield x s' -> do
          writeArray newArr j (x + 1)
          loop s' (j + 1)
        Done -> do
          out <- unsafeFreezeArray newArr
          return (0, j, out)
  loop 0 0

And then compile that again I do get the nice core with two join points where the intermediate Yield and Done constructors are eliminated:

$wtest :: Int -> Int# -> Array Double -> (Int, Int, Array Double)
$wtest
  = \ (w :: Int) (ww :: Int#) (w1 :: Array Double) ->
      runRW#
        (\ (s :: State# RealWorld) ->
           case noDuplicate# s of s' { __DEFAULT ->
           case newArray# ww uninitialised (s' `cast` <Co:3>) of
           { (# ipv, ipv1 #) ->
           joinrec {
             $s$wloop
               :: State# RealWorld -> Int# -> Int# -> (Int, Int, Array Double)
             $s$wloop (sc :: State# RealWorld) (sc1 :: Int#) (sc2 :: Int#)
               = join {
                   $j :: (Int, Int, Array Double)
                   $j
                     = case unsafeFreezeArray# ipv1 (sc `cast` <Co:3>) of
                       { (# ipv2, ipv3 #) ->
                       lazy (test1, I# sc1, Array ipv3)
                       } } in
                 case >=# sc2 ww of {
                   __DEFAULT ->
                     case w of { I# x ->
                     case w1 of { Array ds2 ->
                     case indexArray# ds2 (+# x sc2) of { (# ipv2 #) ->
                     case ipv2 of { D# x1 ->
                     case >## x1 10.0## of {
                       __DEFAULT ->
                         joinrec {
                           $wstep' :: Int# -> (Int, Int, Array Double)
                           $wstep' (ww1 :: Int#)
                             = case >=# ww1 ww of {
                                 __DEFAULT ->
                                   case indexArray# ds2 (+# x ww1) of { (# ipv3 #) ->
                                   case ipv3 of { D# x2 ->
                                   case >## x2 10.0## of {
                                     __DEFAULT -> jump $wstep' (+# ww1 1#);
                                     1# ->
                                       case writeArray#
                                              ipv1 sc1 (D# (+## x2 1.0##)) (sc `cast` <Co:3>)
                                       of s'#
                                       { __DEFAULT ->
                                       jump $s$wloop (s'# `cast` <Co:2>) (+# sc1 1#) (+# ww1 1#)
                                       }
                                   } } };
                                 1# -> jump $j
                               }; } in
                         jump $wstep' (+# sc2 1#);
                       1# ->
                         case writeArray# ipv1 sc1 (D# (+## x1 1.0##)) (sc `cast` <Co:3>)
                         of s'#
                         { __DEFAULT ->
                         jump $s$wloop (s'# `cast` <Co:2>) (+# sc1 1#) (+# sc2 1#)
                         }
                     } } } } };
                   1# -> jump $j
                 }; } in
           jump $s$wloop (ipv `cast` <Co:2>) 0# 0#
           } })

So is the simplifier just running out of steam?

Sep 27 '22 14:09 noughtmare

Hi Jaro

Good progress on this! wrt

So is the simplifier just running out of steam?

Did you try any of the increasing the count of any of the simplifier options in https://downloads.haskell.org/ghc/latest/docs/users_guide/using-optimisation.html ?

Unfortunately I am very far from being an expert but the one that looked most likely to be was the following and the associated -fddump-simpl-stats

-fsimpl-tick-factor=⟨n⟩ Default: 100

GHC’s optimiser can diverge if you write rewrite rules (Rewrite rules https://downloads.haskell.org/ghc/latest/docs/users_guide/exts/rewrite_rules.html#rewrite-rules) that don’t terminate, or (less satisfactorily) if you code up recursion through data types (Bugs in GHC https://downloads.haskell.org/ghc/latest/docs/users_guide/bugs.html#bugs-ghc). To avoid making the compiler fall into an infinite loop, the optimiser carries a “tick count” and stops inlining and applying rewrite rules when this count is exceeded. The limit is set as a multiple of the program size, so bigger programs get more ticks. The -fsimpl-tick-factor flag lets you change the multiplier. The default is 100; numbers larger than 100 give more ticks, and numbers smaller than 100 give fewer.

If the tick-count expires, GHC summarises what simplifier steps it has done; you can use -fddump-simpl-stats to generate a much more detailed list. Usually that identifies the loop quite accurately, because some numbers are very large.

https://downloads.haskell.org/ghc/latest/docs/users_guide/using-optimisation.html#ghc-flag--fsimpl-tick-factor=%E2%9F%A8n%E2%9F%A9

On Tue, Sep 27, 2022 at 11:17 AM Jaro Reinders @.***> wrote:

If I translate the produced core back to Haskell:

module T where import Control.Monad.Primitiveimport Data.Primitive.Arrayimport System.IO.Unsafe (unsafePerformIO) data Step s a = Yield a s | Done

uninitialised = undefined test :: Int -> Int -> Array Double -> (Int, Int, Array Double) test off n oldArr = unsafePerformIO $ do newArr <- newArray n uninitialised let step' i | i >= n = Done | otherwise = let x = indexArray oldArr (off + i) in if x > 10 then Yield x (i + 1) else step' (i + 1) loop i sc2 = do let c = step' i case c of Yield x s' -> do writeArray newArr sc2 (x + 1) loop s' (sc2 + 1) Done -> do out <- unsafeFreezeArray newArr return (0, sc2, out) loop 0 0

And then compile that again I do get the nice core with two join points where the intermediate Yield and Done constructors are eliminated:

$wtest :: Int -> Int# -> Array Double -> (Int, Int, Array Double)$wtest = \ (w :: Int) (ww :: Int#) (w1 :: Array Double) -> runRW# (\ (s :: State# RealWorld) -> case noDuplicate# s of s' { __DEFAULT -> case newArray# ww uninitialised (s' cast Co:3) of { (# ipv, ipv1 #) -> joinrec { $s$wloop :: State# RealWorld -> Int# -> Int# -> (Int, Int, Array Double) $s$wloop (sc :: State# RealWorld) (sc1 :: Int#) (sc2 :: Int#) = join { $j :: (Int, Int, Array Double) $j = case unsafeFreezeArray# ipv1 (sc cast Co:3) of { (# ipv2, ipv3 #) -> lazy (test1, I# sc1, Array ipv3) } } in case >=# sc2 ww of { __DEFAULT -> case w of { I# x -> case w1 of { Array ds2 -> case indexArray# ds2 (+# x sc2) of { (# ipv2 #) -> case ipv2 of { D# x1 -> case >## x1 10.0## of { __DEFAULT -> joinrec { $wstep' :: Int# -> (Int, Int, Array Double) $wstep' (ww1 :: Int#) = case >=# ww1 ww of { __DEFAULT -> case indexArray# ds2 (+# x ww1) of { (# ipv3 #) -> case ipv3 of { D# x2 -> case >## x2 10.0## of { __DEFAULT -> jump $wstep' (+# ww1 1#); 1# -> case writeArray# ipv1 sc1 (D# (+## x2 1.0##)) (sc cast Co:3) of s'# { __DEFAULT -> jump $s$wloop (s'# cast Co:2) (+# sc1 1#) (+# ww1 1#) } } } }; 1# -> jump $j }; } in jump $wstep' (+# sc2 1#); 1# -> case writeArray# ipv1 sc1 (D# (+## x1 1.0##)) (sc cast Co:3) of s'# { __DEFAULT -> jump $s$wloop (s'# cast Co:2) (+# sc1 1#) (+# sc2 1#) } } } } } }; 1# -> jump $j }; } in jump $s$wloop (ipv cast Co:2) 0# 0# } })

So is the simplifier just running out of steam?

— Reply to this email directly, view it on GitHub https://github.com/haskell/vector/issues/156#issuecomment-1259576807, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABQIJ65QW6AXKBWUDW4Q7ODWAL6XHANCNFSM4DA7DW2Q . You are receiving this because you were mentioned.Message ID: @.***>

Sep 27 '22 14:09 GeorgeCo

I did try increasing the max simplifier iterations which did not help.

But now I've rewritten it using the lower level unboxed operations:

{-# LANGUAGE MagicHash #-}
{-# LANGUAGE UnboxedTuples #-}
module T where

import GHC.Exts
import GHC.IO

data Step s a = Yield a s | Done

uninitialised = undefined

test :: Int# -> Int# -> Array# Double -> (# Int#, Int#, Array# Double #)
test off n oldArr = runRW# $ \s0 ->
  case newArray# n uninitialised s0
   of { (# s1, newArr #) ->
  let
    step' i
      | isTrue# (i >=# n) = Done
      | otherwise =
        let (# D# x #) = indexArray# oldArr (off +# i) in
        if isTrue# (x >## 10.0##)
        then Yield (D# x) (I# (i +# 1#))
        else step' (i +# 1#)
    loop i j s2 =
      case step' i of
        Yield x (I# s') ->
          case writeArray# newArr j (x + 1) s2
           of { s3 -> 
          loop s' (j +# 1#) s3
        }
        Done ->
          case unsafeFreezeArray# newArr s2
           of { (# s3, out #) ->
          (# 0#, j, out #)
        }
  in
  loop 0# 0# s1
  }

And compiling this program does show the optimisation failure:

test
  = \ off n oldArr ->
      runRW#
        (\ s0 ->
           case newArray# n uninitialised s0 of { (# ipv, ipv1 #) ->
           letrec {
             step'
               = \ i ->
                   case >=# i n of {
                     __DEFAULT ->
                       case indexArray# oldArr (+# off i) of { (# ipv2 #) ->
                       case ipv2 of wild { D# x ->
                       case >## x 10.0## of {
                         __DEFAULT -> step' (+# i 1#);
                         1# -> Yield wild (I# (+# i 1#))
                       }
                       }
                       };
                     1# -> Done
                   }; } in
           join {
             exit j s2
               = case unsafeFreezeArray# ipv1 s2 of { (# ipv2, ipv3 #) ->
                 (# 0#, j, ipv3 #)
                 } } in
           joinrec {
             loop i j s2
               = case step' i of {
                   Yield x ds1 ->
                     case ds1 of { I# s' ->
                     case writeArray#
                            ipv1 j (case x of { D# x1 -> D# (+## x1 1.0##) }) s2
                     of s3
                     { __DEFAULT ->
                     jump loop s' (+# j 1#) s3
                     }
                     };
                   Done -> jump exit j s2
                 }; } in
           jump loop 0# 0# ipv
           })

Sep 27 '22 14:09 noughtmare

I think it is time to open a proper GHC issue for this, so I've done that: https://gitlab.haskell.org/ghc/ghc/-/issues/22227

Sep 27 '22 15:09 noughtmare

this is great :)

On Tue, Sep 27, 2022 at 11:08 AM Jaro Reinders @.***> wrote:

I think it is time to open a proper GHC issue for this, so I've done that: https://gitlab.haskell.org/ghc/ghc/-/issues/22227

— Reply to this email directly, view it on GitHub https://github.com/haskell/vector/issues/156#issuecomment-1259647533, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAABBQVAEPOXFMAY4EJVU4DWAMEWBANCNFSM4DA7DW2Q . You are receiving this because you commented.Message ID: @.***>

Sep 27 '22 17:09 cartazio

Good news! I have been able to manually implement the "loopification" optimization that GHC is missing (see this WIP merge request of GHC) with https://github.com/haskell/vector/pull/448/commits/cfaf0d4b0102b294784378bc3d77e6c7ceb2407f. Now almost all benchmarks are the same or faster (tridiag speeds up the most from 86ms to 19.3ms) except awshcc which is still about 2x slower (406ms to 872ms). I'll look into that now. For now, I can already say is that it is able to fuse properly so the problem must be something else.

Nov 01 '22 15:11 noughtmare

Wow !

Nov 01 '22 15:11 GeorgeCo

Would love to see the table of timing changes

Nov 01 '22 19:11 cartazio

I've now reached the point that all benchmarks are (approximately) equal or faster, see https://github.com/haskell/vector/pull/448#issuecomment-1299110013.

Nov 01 '22 20:11 noughtmare

@noughtmare That is all pretty great news.

If I understand correctly the proper speed up is only observed with the ghc patch

How does the regression look for all the older ghc versions. We do support all the way down to ghc-8.0

Nov 01 '22 21:11 lehins

@lehins

If I understand correctly the proper speed up is only observed with the ghc patch

No, I've manually implemented the optimization that GHC would do with that patch.

For now I've tested with GHC 9.2.4. I'll do some more tests with other GHC versions, but I believe this only really requires join points (I don't know when those were introduced, but I think GHC 8 already had them).

Nov 01 '22 21:11 noughtmare

Now I understand, that is great! Awesome work!

Let's make sure we don't mess up performance for anyone and I don't see a reason why we shouldn't get this change into vector.

Nov 01 '22 21:11 lehins

vector vector copied to clipboard

Consider design change in light of GHC join point work

vector
vector copied to clipboard