minimal but convincing example?
I would very much want that the documentation contains a minimal convincing example (and that this would also be included in tests, so we can be sure that it works).
"Convincing" in the sense that
- running with
+RTS -N<k>(for small k) visibly, and progressively, reduces execution time - code is readable and stand-alone (no fancy extra libraries)
For reference, I am using the following C# example (sum of bitcounts) in teaching and I like that it has these properties:
- it's two one-liners (naive bitcount implementation, naive summation)
- parallelisation is trivial (add
.AsParallel()) - effect is visible (cut execution time nearly in half)
- works out-of-the-box in
csharpREPL
Func<int,int> bc = (int x) => { int c=0; while (x>0) { c += x%2 ; x >>= 1 ; } return c; }
Time(() => Console.WriteLine( Enumerable.Range(0,1<<27).Select(bc).Sum()))
1811939328
00:00:03.5504990
Time(() => Console.WriteLine( Enumerable.Range(0,1<<27).AsParallel().Select(bc).Sum()))
1811939328
00:00:02.1616790
If there is such an example (my naive use of parListChunk does not seem to cut it) I'm happy to write it up as haddock and submit a PR.
NB - Sure I know (and discuss with students) that there are better implementations of bc, and that sum-of-bitcounts (up to powers of two) has a closed form, so we don't actually need any of this...
‰ time ghc -e 'import Control.Parallel.Strategies' -e 'import Control.DeepSeq' -e 'fib x = product [1..x]' -e 'deepseq (parMap rdeepseq fib $ replicate 5 200000) ()' +RTS -N
()
ghc -e 'import Control.Parallel.Strategies' -e 'import Control.DeepSeq' -e - 36.62s user 5.15s system 388% cpu 10.743 total
‰ time ghc -e 'import Control.Parallel.Strategies' -e 'import Control.DeepSeq' -e 'fib x = product [1..x]' -e 'deepseq (map fib $ replicate 5 200000) ()' +RTS -N
()
ghc -e 'import Control.Parallel.Strategies' -e 'import Control.DeepSeq' -e - 33.28s user 6.57s system 165% cpu 24.062 total
yes, nice. (fib = factorial?)
I have to specify (reduce) the number of capabilities (my computer has too many?)
$ time ghc -e 'import Control.Parallel.Strategies' -e 'import Control.DeepSeq' -e 'fib x = product [1..x]' -e 'deepseq (parMap rdeepseq fib $ replicate 5 200000) ()' +RTS -N6
()
real 0m6.178s
user 0m20.932s
sys 0m1.761s
$ time ghc -e 'import Control.Parallel.Strategies' -e 'import Control.DeepSeq' -e 'fib x = product [1..x]' -e 'deepseq (map fib $ replicate 5 200000) ()' +RTS -N6
()
real 0m12.747s
user 0m13.292s
sys 0m1.911s
you are throwing away the result (I understand - else we'd be measuring the time for printing), my example above has the advantage that we can see, and compare results.
and .. the rabbit holes ... (why cut time only in half, where does the extra system time come from, why does this allocate at all, etc., etc.)
perhaps this? a slightly silly way of repeatedly computing factorial of 10
import Control.Parallel.Strategies
import Data.List (permutations)
main = print
$ withStrategy (parList rseq)
$ map (length . permutations . take 10 . enumFrom) [1 .. 8]
or, with ghci:
$ ghci +RTS -N8 -A1m
GHCi, version 9.2.4: https://www.haskell.org/ghc/ :? for help
ghci> import Control.Parallel.Strategies
ghci> import Data.List (permutations)
ghci> :set +s
ghci> map (length . permutations . take 10 . enumFrom) [1 .. 8]
[3628800,3628800,3628800,3628800,3628800,3628800,3628800,3628800]
(3.12 secs, 8,737,949,784 bytes)
ghci> withStrategy (parList rseq) $ map (length . permutations . take 10 . enumFrom) [1 .. 8]
[3628800,3628800,3628800,3628800,3628800,3628800,3628800,3628800]
(1.28 secs, 1,092,361,960 bytes)
if we use -N4 when starting ghci, we see first four list elements appear at the same time, then next four. Or we can do
ghci> setNumCapabilities 1
(0.00 secs, 34,296 bytes)
ghci> withStrategy (parList rseq) $ map (length . permutations . take 10 . enumFrom) [1 .. 8]
[3628800,3628800,3628800,3628800,3628800,3628800,3628800,3628800]
(2.80 secs, 8,737,948,216 bytes)
ghci> setNumCapabilities 2
(0.00 secs, 34,304 bytes)
ghci> withStrategy (parList rseq) $ map (length . permutations . take 10 . enumFrom) [1 .. 8]
[3628800,3628800,3628800,3628800,3628800,3628800,3628800,3628800]
(1.69 secs, 1,092,362,304 bytes)
warning - more rabbit holes: https://gitlab.haskell.org/ghc/ghc/-/issues/20783
I guess the problem with such examples in ghci is:
- more allocations => more work for the RTS =?=> less parallelisable
- bytecode produced by ghci is not optimised (enough)? => hard to program a non-allocating workload.
Your factorial example is nice because it spends time in GMP (long multiplication) mostly (?). In my example, work is done in Data.List.permutations, a library module, for which ghci loads (optimized) object code (not byte code)? But then it's a "fancy extra library" (see top post) ...