parallel minimal but convincing example?

I would very much want that the documentation contains a minimal convincing example (and that this would also be included in tests, so we can be sure that it works).

"Convincing" in the sense that

running with +RTS -N<k> (for small k) visibly, and progressively, reduces execution time
code is readable and stand-alone (no fancy extra libraries)

For reference, I am using the following C# example (sum of bitcounts) in teaching and I like that it has these properties:

it's two one-liners (naive bitcount implementation, naive summation)
parallelisation is trivial (add .AsParallel())
effect is visible (cut execution time nearly in half)
works out-of-the-box in csharp REPL

Func<int,int> bc = (int x) => { int c=0; while (x>0) {  c += x%2 ;   x >>= 1 ;  }   return c;   } 

Time(() => Console.WriteLine( Enumerable.Range(0,1<<27).Select(bc).Sum()))                             
       1811939328
       00:00:03.5504990

Time(() => Console.WriteLine( Enumerable.Range(0,1<<27).AsParallel().Select(bc).Sum()))
        1811939328                                                                                                                    
        00:00:02.1616790

If there is such an example (my naive use of parListChunk does not seem to cut it) I'm happy to write it up as haddock and submit a PR.

NB - Sure I know (and discuss with students) that there are better implementations of bc, and that sum-of-bitcounts (up to powers of two) has a closed form, so we don't actually need any of this...

Oct 23 '18 07:10 jwaldmann

‰ time ghc -e 'import Control.Parallel.Strategies' -e 'import Control.DeepSeq' -e 'fib x = product [1..x]' -e 'deepseq (parMap rdeepseq fib $ replicate 5 200000) ()' +RTS -N
()
ghc -e 'import Control.Parallel.Strategies' -e 'import Control.DeepSeq' -e  -  36.62s user 5.15s system 388% cpu 10.743 total
‰ time ghc -e 'import Control.Parallel.Strategies' -e 'import Control.DeepSeq' -e 'fib x = product [1..x]' -e 'deepseq (map fib $ replicate 5 200000) ()' +RTS -N               
()
ghc -e 'import Control.Parallel.Strategies' -e 'import Control.DeepSeq' -e  -  33.28s user 6.57s system 165% cpu 24.062 total

Sep 30 '22 14:09 l29ah

yes, nice. (fib = factorial?)

I have to specify (reduce) the number of capabilities (my computer has too many?)


$ time ghc -e 'import Control.Parallel.Strategies' -e 'import Control.DeepSeq' -e 'fib x = product [1..x]' -e 'deepseq (parMap rdeepseq fib $ replicate 5 200000) ()' +RTS -N6 
()

real	0m6.178s
user	0m20.932s
sys	0m1.761s

$ time ghc -e 'import Control.Parallel.Strategies' -e 'import Control.DeepSeq' -e 'fib x = product [1..x]' -e 'deepseq (map fib $ replicate 5 200000) ()' +RTS -N6 
()

real	0m12.747s
user	0m13.292s
sys	0m1.911s

you are throwing away the result (I understand - else we'd be measuring the time for printing), my example above has the advantage that we can see, and compare results.

and .. the rabbit holes ... (why cut time only in half, where does the extra system time come from, why does this allocate at all, etc., etc.)

Sep 30 '22 15:09 jwaldmann

perhaps this? a slightly silly way of repeatedly computing factorial of 10

import Control.Parallel.Strategies
import Data.List (permutations)

main =  print
    $ withStrategy (parList rseq)
    $ map (length . permutations . take 10 . enumFrom) [1 .. 8]

or, with ghci:

$ ghci  +RTS -N8 -A1m

GHCi, version 9.2.4: https://www.haskell.org/ghc/  :? for help
ghci> import Control.Parallel.Strategies
ghci> import Data.List (permutations)
ghci> :set +s
ghci> map (length . permutations . take 10 . enumFrom) [1 .. 8]
[3628800,3628800,3628800,3628800,3628800,3628800,3628800,3628800]
(3.12 secs, 8,737,949,784 bytes)
ghci> withStrategy (parList rseq)  $ map (length . permutations . take 10 . enumFrom) [1 .. 8]
[3628800,3628800,3628800,3628800,3628800,3628800,3628800,3628800]
(1.28 secs, 1,092,361,960 bytes)

if we use -N4 when starting ghci, we see first four list elements appear at the same time, then next four. Or we can do

ghci> setNumCapabilities 1
(0.00 secs, 34,296 bytes)
ghci> withStrategy (parList rseq)  $ map (length . permutations . take 10 . enumFrom) [1 .. 8]
[3628800,3628800,3628800,3628800,3628800,3628800,3628800,3628800]
(2.80 secs, 8,737,948,216 bytes)
ghci> setNumCapabilities 2
(0.00 secs, 34,304 bytes)
ghci> withStrategy (parList rseq)  $ map (length . permutations . take 10 . enumFrom) [1 .. 8]
[3628800,3628800,3628800,3628800,3628800,3628800,3628800,3628800]
(1.69 secs, 1,092,362,304 bytes)

warning - more rabbit holes: https://gitlab.haskell.org/ghc/ghc/-/issues/20783

I guess the problem with such examples in ghci is:

more allocations => more work for the RTS =?=> less parallelisable
bytecode produced by ghci is not optimised (enough)? => hard to program a non-allocating workload.

Your factorial example is nice because it spends time in GMP (long multiplication) mostly (?). In my example, work is done in Data.List.permutations, a library module, for which ghci loads (optimized) object code (not byte code)? But then it's a "fancy extra library" (see top post) ...

Oct 04 '22 12:10 jwaldmann