tonyday567/perf: high-resolution performance benchmarking

trafficstars

#+TITLE: perf #+PROPERTY: header-args :exports both #+PROPERTY: header-args :eval no-export

[[https://hackage.haskell.org/package/perf][file:https://img.shields.io/hackage/v/perf.svg]] [[https://github.com/tonyday567/perf/actions?query=workflow%3Ahaskell-ci][file:https://github.com/tonyday567/perf/workflows/haskell-ci/badge.svg]]

Introduction

~perf~ provides some ideas, code and a library for low-level performance measurement for Haskell hacking. The library:

provides a monad transformer, ~PerfT~, as a light-weight wrapper for use on existing code. ~PerfT~ modifications can be included in code bases, as opposed to performance being separated code and process, with any effects able to be erased at compile time with ~evalPerfT~.
focuses on using the [[https://en.wikipedia.org/wiki/Time_Stamp_Counter][rdtsc]] instruction set that comes with the x86 chip to measure accurate time performance as number of clock cycles.
abstracts the concept of what is being measured, so that concepts such as counters, debug checks, time and space performance can be treated equivalently.

Setup

Note that running this readme.org is very slow compared with an external process which accesses the compiled version of the library.

#+begin_src haskell-ng :results output :r :set -Wno-type-defaults :set -Wno-unused-do-bind :set -Wno-name-shadowing :set -XOverloadedStrings :set -XOverloadedLabels import Perf import Data.FormatN import qualified Data.Text as Text import qualified Data.Text.IO as Text import qualified Data.Map.Strict as Map import Control.Monad import Data.Bifunctor putStrLn "ok" #+end_src

#+RESULTS: #+begin_example Build profile: -w ghc-9.8.1 -O1 In order, the following will be built (use -v for more details):

perf-0.12.0.1 (lib) (first run) Preprocessing library for perf-0.12.0.1.. GHCi, version 9.8.1: https://www.haskell.org/ghc/ :? for help Loaded GHCi configuration from /Users/tonyday/haskell/perf/.ghci [ 1 of 10] Compiling Perf.Stats ( src/Perf/Stats.hs, interpreted ) [ 2 of 10] Compiling Perf.Types ( src/Perf/Types.hs, interpreted ) [ 3 of 10] Compiling Perf.Time ( src/Perf/Time.hs, interpreted ) [ 4 of 10] Compiling Perf.Space ( src/Perf/Space.hs, interpreted ) [ 5 of 10] Compiling Perf.Count ( src/Perf/Count.hs, interpreted ) [ 6 of 10] Compiling Perf.Measure ( src/Perf/Measure.hs, interpreted ) [ 7 of 10] Compiling Perf.Report ( src/Perf/Report.hs, interpreted ) [ 8 of 10] Compiling Perf.BigO ( src/Perf/BigO.hs, interpreted ) [ 9 of 10] Compiling Perf.Algos ( src/Perf/Algos.hs, interpreted ) [10 of 10] Compiling Perf ( src/Perf.hs, interpreted ) Ok, 10 modules loaded. Ok, 10 modules loaded. ok #+end_example

Time ** What is a tick?

The fundamental operation of Perf.Time is tick, which sandwiches a (strict) function application between two readings of the rdstc chip.

#+begin_src haskell-ng :results output :exports both :t tick #+end_src

#+RESULTS: : tick :: (a -> b) -> a -> IO (Cycles, b)

tick returns in the IO monad, because reading a cycle counter is an IO effect. The trivial but fundamental point is that performance measurement effects the computation being measured.

Unlike benchmarking-style libraries, such as criterion, the result of the computation is passed through along with the cycle count. In this way, the Perf monad can be inserted into the midst of a computation in an attempt to measure performance in-situ as opposed to sitting off in a separate and perhaps unrepresentative process.

** tick_

tick_ measures the number of cycles between two clock reads.

#+begin_src haskell-ng :results output :exports both :t tick_ #+end_src

#+RESULTS: : tick_ :: IO Cycles

#+begin_src haskell-ng :results output :exports both fmap word <$> replicateM 10 tick_ #+end_src

#+RESULTS: : [3063,1201,990,978,945,948,951,948,951,948]

** multiple ticks

#+begin_src haskell-ng :results output :exports both fmap (fmap (word . fst)) . replicateM 10 $ tick (const ()) () #+end_src

#+RESULTS: : [12104,5040,4868,4508,4188,4172,4800,4748,5760,4216]

Here, ~const () ()~ was evaluated and took 6.1k cycles for the first effect, reducing down to 2.6k after 10 effects. What it did in that time we will never know, really, at least from usage of ~perf~. 2.2k cycles is, on my 2.5GHz machine equal to 2200 / 2.5e9 = 8.8e-7 of a second or 880 nanoseconds.

** tickIO

~tickIO~ measures the evaluation of an IO value.

#+begin_src haskell-ng :results output :exports both :t tickIO #+end_src

#+RESULTS: : tickIO :: IO a -> IO (Cycles, a)

#+begin_src haskell-ng :results output :exports both fmap (fmap (word . fst)) . replicateM 10 $ tickIO (pure ()) #+end_src

#+RESULTS: : [10084,3588,4104,2844,2916,3140,2800,2848,2848,2804]

** sum example

#+begin_src haskell-ng :exports both fmap (expt (Just 2) . fromIntegral) . fst <$> ticks 10 sum ([1..10000] :: [Double]) #+end_src

#+RESULTS: : ["1.7e6","6.1e5","6.2e5","5.4e5","4.3e5","4.1e5","4.1e5","4.3e5","4.1e5","4.1e5"]

#+begin_src haskell-ng :results output :exports both ts <- ticks 10000 sum ([1..1000] :: [Double]) print $ average (fmap fromIntegral $ fst ts) #+end_src

#+RESULTS: : 30411.6957

PerfT

~PerfT~ allows for multiple measurement points and is polymorphic in what is being measured. It returns a Map of results held in State.

Compare a lower-level usage of ticks, measuring the average of summing to one thousand over one thousand trials:

#+begin_src haskell-ng :results output :exports both first (average . fmap fromIntegral) <$> ticks 1000 sum [1..1000] #+end_src

#+RESULTS: : (50914.804,500500)

... with PerfT usage

#+begin_src haskell-ng :results output :exports both second (fmap (average . fmap fromIntegral)) <$> runPerfT (times 1000) (sum |$| [1..1000]) #+end_src

#+RESULTS: : (500500,fromList [("",44362.311)])

Comparing performance of sum versus a list fusion approach:

#+begin_src haskell-ng :results output :exports both fmap (average . fmap fromIntegral) <$> (execPerfT (times 1000) $ do; (fap "sum" sum [1..1000]); (fap "fusion" (\x -> sum [1..x]) 1000)) #+end_src

#+RESULTS: : fromList [("fusion",97804.085),("sum",45053.771)]

An IO example

#+begin_src haskell-ng exampleIO' :: IO () exampleIO' = do txt <- Text.readFile "src/Perf.hs" let n = Text.length txt Text.putStrLn $ "length of file is: " <> Text.pack (show n) #+end_src

#+begin_src haskell-ng :results output :exports both exampleIO = execPerfT time (do txt <- fam "file_read" (Text.readFile "src/Perf.hs") n <- fap "length" Text.length txt fam "print_result" (Text.putStrLn $ "length of file is: " <> Text.pack (show n))) #+end_src

#+begin_src sh :results output :exports both perf-explore --exampleIO #+end_src

#+RESULTS: #+begin_example length of file is: 1794 length of file is: 1794

label1 label2 label3 old result new result change

normal file-read time 5.96e5 6.58e5 slightly-degraded normal length time 1.10e4 7.12e3 improvement normal print-result time 1.44e5 5.70e4 improvement outer file-read time 1.36e5 1.39e5 outer length time 1.45e3 1.33e3 improvement outer outer-total time 1.67e5 1.50e5 improvement outer print-result time 2.70e4 6.26e3 improvement #+end_example

perf-explore

~perf-explore~ contains some exploratory routines used to develop =perf=

#+begin_src sh :results output :exports both perf-explore --help #+end_src

#+RESULTS: #+begin_example examples of perf usage

perf exploration

Available options: -n,--runs ARG number of runs to perform --best report upper decile --median report median --average report average --averagesecs report average in seconds --time measure time performance --space measure space performance --spacetime measure both space and time performance --allocation measure bytes allocated --count measure count -g,--golden ARG golden file name --nocheck do not check versus the golden file -r,--record record the result to the golden file --header include headers --noheader dont include headers --error ARG error level --warning ARG warning level --improved ARG improved level --sums run on sum algorithms --lengths run on length algorithms --nub nub test --examples run on example algorithms --example run on the example algorithm --exampleIO exampleIO test --noops noops test --ticks tick test --gauge gauge runs on exmaple for comparison -l,--length ARG length of list --sumFuse fused sum pipeline --sum sum --lengthF foldr id length --constFuse fused const pipeline --mapInc fmap (+1) --noOp const () -h,--help Show this help text #+end_example

#+begin_src haskell-ng :results output :exports both fmap averageI <$> execPerfT (times 10000) (sum |$| [1..1000]) #+end_src

#+RESULTS: : fromList [("",35729.9518)]

The equivalent to the above code is:

#+begin_src sh :results output :exports both perf-explore -n 10000 -l 1000 --sum --nocheck #+end_src

#+RESULTS: : label1 label2 results : : sum time 1.47e4

** noops

This no-op experiment is useful to understand the pure time performance of the machinery around measurement. It can be (re)run with:

#+begin_src sh :results output :exports both perf-explore --noops #+end_src

#+RESULTS: #+begin_example label1 label2 label3 results

const 1st time 6.05e3 const 2nd time 2.80e1 const 3rd time 3.60e1 const 4th time 3.20e1 const average time 4.11e1 const best time 2.74e1 const median time 3.54e1 pure 1st time 6.46e3 pure 2nd time 7.60e1 pure 3rd time 4.00e1 pure 4th time 4.00e1 pure average time 4.28e1 pure best time 2.82e1 pure median time 3.64e1 No golden file found. To create one, run with '-r' #+end_example

** measurement context

Exploration of how the code surrounding measurement effects performance.

#+begin_src sh :results drawer :exports both perf-explore -n 1000 -l 1000 --ticks --nocheck #+end_src

#+RESULTS: :results: | | stepTime | | sumAux | 1.46e4 | 1.51e4 | | sumCata | 1.45e4 | 1.82e4 | | sumCo | 1.45e4 | 1.46e4 | | sumCoCase | 1.44e4 | 1.51e4 | | sumCoGo | 2.61e4 | 1.46e4 | | sumF | 1.34e4 | 1.34e4 | | sumFlip | 1.37e4 | 1.34e4 | | sumFlipLazy | 1.33e4 | 1.34e4 | | sumFoldr | 1.44e4 | 1.45e4 | | sumFuse | 3.19e3 | 1.98e3 | | sumFuseFoldl' | 1.88e3 | 2.46e3 | | sumFuseFoldr | 5.79e3 | 6.01e3 | | sumFusePoly | 2.67e3 | 2.68e3 | | sumLambda | 1.59e4 | 1.35e4 | | sumMono | 1.38e4 | 1.69e4 | | sumPoly | 1.99e4 | 1.42e4 | | sumSum | 1.34e4 | 1.36e4 | | sumTail | 2.03e4 | 1.37e4 | | sumTailLazy | 1.34e4 | 1.66e4 | :end: tick | tickForce | tickForceArgs | tickLazy | tickWHNF | times | 1.65e4 | 1.45e4 | 1.96e1 | 1.45e4 | 1.46e4 | 1.45e4 | 1.45e4 | 1.99e1 | 1.46e4 | 1.48e4 | 1.46e4 | 1.63e4 | 2.02e1 | 1.45e4 | 1.44e4 | 1.61e4 | 1.46e4 | 2.78e1 | 2.10e4 | 1.49e4 | 3.82e4 | 2.55e4 | 2.04e1 | 1.45e4 | 2.59e4 | 2.01e4 | 1.64e4 | 2.01e1 | 1.33e4 | 1.33e4 | 1.34e4 | 1.33e4 | 2.06e1 | 1.41e4 | 1.33e4 | 1.33e4 | 1.34e4 | 2.01e1 | 1.37e4 | 1.34e4 | 2.18e4 | 1.68e4 | 3.22e1 | 1.54e4 | 1.45e4 | 1.93e3 | 2.19e3 | 1.98e1 | 1.83e3 | 3.15e3 | 2.02e3 | 2.27e3 | 2.39e1 | 2.38e3 | 2.15e3 | 3.70e3 | 3.70e3 | 2.00e1 | 3.70e3 | 6.11e3 | 2.65e3 | 2.29e3 | 2.73e1 | 2.44e3 | 2.60e3 | 1.47e4 | 1.99e4 | 1.99e1 | 1.36e4 | 1.34e4 | 1.35e4 | 1.36e4 | 2.04e1 | 1.37e4 | 1.34e4 | 1.35e4 | 1.73e4 | 1.97e1 | 1.35e4 | 1.51e4 | 1.34e4 | 1.35e4 | 1.99e1 | 1.34e4 | 2.01e4 | 1.34e4 | 1.34e4 | 1.99e1 | 1.34e4 | 2.01e4 | 1.33e4 | 1.41e4 | 1.99e1 | 1.35e4 | 1.33e4 |

*** short list #+begin_src sh :results drawer :exports both perf-explore -n 10000 -l 10 --best --ticks #+end_src

#+RESULTS: :results: | | stepTime | tick | tickForce | tickForceArgs | tickLazy | tickWHNF | times | | sumAux | 1.81e2 | 2.33e2 | 1.81e2 | 1.82e2 | 1.32e1 | 1.83e2 | 1.82e2 | | sumCata | 1.81e2 | 2.33e2 | 1.81e2 | 1.82e2 | 1.33e1 | 2.33e2 | 1.81e2 | | sumCo | 2.10e2 | 2.33e2 | 2.13e2 | 2.11e2 | 2.49e1 | 2.33e2 | 2.10e2 | | sumCoCase | 1.81e2 | 2.47e2 | 1.82e2 | 1.82e2 | 1.83e1 | 2.47e2 | 1.81e2 | | sumCoGo | 1.81e2 | 2.34e2 | 2.34e2 | 2.34e2 | 1.33e1 | 2.33e2 | 1.82e2 | | sumF | 1.03e2 | 2.22e2 | 1.05e2 | 1.04e2 | 1.88e1 | 1.83e2 | 1.04e2 | | sumFlip | 1.13e2 | 1.28e2 | 1.20e2 | 1.20e2 | 1.88e1 | 1.17e2 | 1.63e2 | | sumFlipLazy | 1.13e2 | 2.07e2 | 1.76e2 | 1.20e2 | 2.49e1 | 2.04e2 | 1.01e2 | | sumFoldr | 1.81e2 | 2.66e2 | 2.65e2 | 1.81e2 | 1.88e1 | 2.65e2 | 1.81e2 | | sumFuse | 2.58e1 | 2.71e1 | 2.85e1 | 3.36e1 | 1.36e1 | 3.36e1 | 3.42e1 | | sumFuseFoldl' | 5.04e1 | 2.71e1 | 5.11e1 | 5.08e1 | 1.41e1 | 2.65e1 | 3.96e1 | | sumFuseFoldr | 8.16e1 | 8.61e1 | 6.26e1 | 7.34e1 | 1.25e1 | 8.47e1 | 8.54e1 | | sumFusePoly | 4.19e1 | 5.82e1 | 4.26e1 | 5.81e1 | 1.30e1 | 5.91e1 | 4.30e1 | | sumLambda | 1.39e2 | 1.30e2 | 1.60e2 | 1.63e2 | 1.33e1 | 1.29e2 | 1.43e2 | | sumMono | 8.04e1 | 1.43e2 | 8.64e1 | 8.41e1 | 1.83e1 | 1.15e2 | 8.16e1 | | sumPoly | 1.16e2 | 1.29e2 | 1.44e2 | 1.21e2 | 2.49e1 | 1.30e2 | 1.18e2 | | sumSum | 1.02e2 | 1.03e2 | 1.04e2 | 1.05e2 | 1.33e1 | 1.02e2 | 1.04e2 | | sumTail | 1.40e2 | 1.37e2 | 1.31e2 | 1.31e2 | 1.36e1 | 1.45e2 | 1.42e2 | | sumTailLazy | 1.06e2 | 1.78e2 | 1.12e2 | 1.12e2 | 1.84e1 | 2.25e2 | 1.09e2 | :end:

*** long list #+begin_src sh :results drawer :exports both perf-explore -n 100 -l 100000 --best --ticks #+end_src

#+RESULTS: :results: | | stepTime | tick | tickForce | tickForceArgs | tickLazy | tickWHNF | times | | sumAux | 2.82e6 | 2.70e6 | 2.66e6 | 2.75e6 | 1.83e1 | 2.70e6 | 2.79e6 | | sumCata | 2.71e6 | 2.69e6 | 2.63e6 | 2.64e6 | 1.28e1 | 2.68e6 | 2.75e6 | | sumCo | 2.71e6 | 2.73e6 | 2.63e6 | 2.64e6 | 2.52e1 | 2.71e6 | 2.73e6 | | sumCoCase | 2.69e6 | 2.70e6 | 2.70e6 | 2.65e6 | 1.85e1 | 2.76e6 | 2.67e6 | | sumCoGo | 2.68e6 | 2.77e6 | 2.65e6 | 2.63e6 | 1.82e1 | 2.75e6 | 2.69e6 | | sumF | 1.24e6 | 1.24e6 | 1.24e6 | 1.24e6 | 1.26e1 | 1.24e6 | 1.27e6 | | sumFlip | 1.16e6 | 1.14e6 | 1.14e6 | 1.16e6 | 1.35e1 | 1.13e6 | 1.16e6 | | sumFlipLazy | 1.03e6 | 1.03e6 | 1.03e6 | 1.02e6 | 1.37e1 | 1.03e6 | 1.03e6 | | sumFoldr | 2.76e6 | 2.83e6 | 2.66e6 | 2.67e6 | 1.82e1 | 2.81e6 | 2.74e6 | | sumFuse | 1.45e5 | 1.45e5 | 1.45e5 | 1.44e5 | 1.39e1 | 1.45e5 | 1.45e5 | | sumFuseFoldl' | 1.45e5 | 1.45e5 | 1.45e5 | 1.46e5 | 1.35e1 | 1.45e5 | 1.45e5 | | sumFuseFoldr | 1.76e6 | 1.77e6 | 1.75e6 | 1.75e6 | 1.90e1 | 1.76e6 | 1.75e6 | | sumFusePoly | 1.45e5 | 2.00e5 | 1.45e5 | 1.45e5 | 1.32e1 | 1.68e5 | 1.45e5 | | sumLambda | 9.35e5 | 9.52e5 | 9.35e5 | 9.38e5 | 2.48e1 | 9.46e5 | 9.44e5 | | sumMono | 1.13e6 | 1.12e6 | 1.11e6 | 1.11e6 | 1.23e1 | 1.11e6 | 1.11e6 | | sumPoly | 1.03e6 | 1.04e6 | 1.02e6 | 1.03e6 | 1.32e1 | 1.03e6 | 1.03e6 | | sumSum | 1.20e6 | 1.22e6 | 1.20e6 | 1.19e6 | 1.32e1 | 1.22e6 | 1.20e6 | | sumTail | 1.22e6 | 1.22e6 | 1.22e6 | 1.22e6 | 1.28e1 | 1.22e6 | 1.22e6 | | sumTailLazy | 1.16e6 | 1.19e6 | 1.18e6 | 1.16e6 | 1.29e1 | 1.18e6 | 1.16e6 | :end:

** sums

#+begin_src sh :output drawer :exports both perf-explore -n 1000 -l 1000 --sums #+end_src

#+RESULTS: #+begin_example label1 label2 old result new result change

sumAux time 1.51e4 1.46e4 sumCata time 1.55e4 1.46e4 improvement sumCo time 1.55e4 2.01e4 degraded sumCoCase time 1.57e4 1.47e4 improvement sumCoGo time 1.54e4 1.69e4 slightly-degraded sumF time 1.33e4 2.02e4 degraded sumFlip time 1.41e4 1.36e4 sumFlipLazy time 1.47e4 1.93e4 degraded sumFoldr time 1.56e4 1.75e4 slightly-degraded sumFuse time 1.66e3 2.61e3 degraded sumFuseFoldl' time 1.71e3 2.70e3 degraded sumFuseFoldr time 5.83e3 6.72e3 slightly-degraded sumFusePoly time 2.12e3 2.64e3 degraded sumLambda time 1.47e4 1.39e4 improvement sumMono time 1.73e4 1.40e4 improvement sumPoly time 1.87e4 1.36e4 improvement sumSum time 1.33e4 1.34e4 sumTail time 1.47e4 1.40e4 improvement sumTailLazy time 1.44e4 1.34e4 improvement #+end_example

** lengths

#+begin_src sh :exports both perf-explore -n 1000 -l 1000 --lengths #+end_src

#+RESULTS: #+begin_example label1 label2 old result new result change

lengthAux time 1.45e4 1.60e4 slightly-degraded lengthCo time 1.64e4 1.39e4 improvement lengthCoCase time 1.78e4 1.31e4 improvement lengthF time 1.06e4 1.10e4 lengthFMono time 1.05e4 1.05e4 lengthFlip time 1.11e4 1.07e4 lengthFlipLazy time 1.13e4 1.07e4 improvement lengthFoldr time 1.27e4 1.33e4 slightly-degraded lengthFoldrConsttime 1.27e4 1.36e4 slightly-degraded lengthTail time 1.08e4 1.10e4 lengthTailLazy time 1.15e4 1.03e4 improvement #+end_example

** Gauge comparison

#+begin_src sh :results output :exports both perf-explore -n 1000 -l 1000 --average --gauge #+end_src

#+RESULTS: #+begin_example sumFuse benchmarking function ... function time 849.4 ns

benchmarking function ... function time 848.6 ns

sum benchmarking function ... function time 3.786 μs

benchmarking function ... function time 3.791 μs

lengthF benchmarking function ... function time 1.863 μs

benchmarking function ... function time 1.872 μs

constFuse benchmarking function ... function time 571.0 ns

benchmarking function ... function time 1.080 μs

mapInc benchmarking function ... function time 9.203 ns

benchmarking function ... function time 15.99 μs

noop benchmarking function ... function time 5.770 ns

benchmarking function ... function time 4.740 ns

#+end_example

** Space

Data is collected from GHCStats

allocated_bytes
gcs
gcdetails_live_bytes
max_live_bytes
max_mem_in_use_bytes

#+begin_src sh :results ouput :exports both perf-explore -n 10 -l 100000 --space +RTS -T -RTS #+end_src

#+RESULTS: : label1 label2 results : : sum MaxMem 4.61e6 : sum allocated 4.19e5 : sum gcLiveBytes 2.19e5 : sum gcollects 1.00e-1 : sum maxLiveBytes 0.00e0 : No golden file found. To create one, run with '-r'

Perf.BigO

Perf.BigO represents functionality to determine the complexity order for a computation.

We could do a regression and minimise the error term, but we know that the largest run contains the most information; we would need to weight the simulations according to some heuristic.

Instead, we:

estimate the order factor for each possible Order, from N3 to N0, setting the highest n run constant factor to zero,
pick the order based on lowest absolute error result summed across all the runs,

#+begin_src haskell-ng :results output :exports both import qualified Prelude as P import Data.List (nub) estOrder (\x -> sum $ nub [1..x]) 10 [1,10,100,1000] #+end_src

#+RESULTS: : BigOrder {bigOrder = N2, bigFactor = 13.375059, bigConstant = 0.0}

References

[[https://github.com/haskell-perf/checklist][The Haskell performance checklist]]

[[https://github.com/ndmitchell/spaceleak][ndmitchell/spaceleak: Notes on space leaks]]

** Core

[[https://ghc.gitlab.haskell.org/ghc/doc/users_guide/debugging.html#options-debugging][5.13. Debugging the compiler]]

#+begin_src sh ghc app/speed.hs -ddump-simpl -ddump-to-file -fforce-recomp -dlint -O #+end_src

[[https://wiki.haskell.org/Performance/GHC#Looking_at_the_Core][haskell wiki: Looking at the Core]]

[[https://godbolt.org/][godbolt]]

[[https://gitlab.haskell.org/ghc/ghc/-/issues/15185][ghc issue 15185: Enum instance for IntX / WordX are inefficient]]

[[https://fixpt.de/blog/2017-12-04-strictness-analysis-part-1.html][fixpt - All About Strictness Analysis (part 1)]]

** Profiling *** setup [[https://ghc.gitlab.haskell.org/ghc/doc/users_guide/profiling.html#prof-heap][8. Profiling]]

A typical configuration step for profiling:

#+begin_src sh :results output cabal configure --enable-library-profiling --enable-executable-profiling -fprof-auto -fprof -write-ghc-environment-files=always #+end_src

A cabal.project.local with profiling enabled:

#+begin_quote write-ghc-environment-files: always ignore-project: False flags: +prof +prof-auto library-profiling: True executable-profiling: True #+end_quote

Examples from markup-parse R&D:

Executable compilation:

#+begin_src sh :results output ghc -prof -fprof-auto -rtsopts app/speed0.hs -threaded -fforce-recomp #+end_src

Executable run:

#+begin_src sh :results output app/speed0 +RTS -s -p -hc -l -RTS #+end_src

*** Space usage output (-s)

#+begin_example 885,263,472 bytes allocated in the heap 8,507,448 bytes copied during GC 163,200 bytes maximum residency (4 sample(s)) 27,752 bytes maximum slop 6 MiB total memory in use (0 MiB lost due to fragmentation)

                                 Tot time (elapsed)  Avg pause  Max pause

Gen 0 207 colls, 0 par 0.009s 0.010s 0.0001s 0.0002s Gen 1 4 colls, 0 par 0.001s 0.001s 0.0004s 0.0005s

TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)

SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

INIT time 0.006s ( 0.006s elapsed) MUT time 0.367s ( 0.360s elapsed) GC time 0.010s ( 0.011s elapsed) RP time 0.000s ( 0.000s elapsed) PROF time 0.000s ( 0.000s elapsed) EXIT time 0.001s ( 0.001s elapsed) Total time 0.384s ( 0.380s elapsed) #+end_example

*** Cost center profile (-p)

Dumped to speed0.prof

#+begin_example COST CENTRE MODULE SRC %time %alloc

token MarkupParse src/MarkupParse.hs:(259,1)-(260,20) 50.2 50.4 wrappedQ' MarkupParse.FlatParse src/MarkupParse/FlatParse.hs:(215,1)-(217,78) 20.8 23.1 ws_ MarkupParse.FlatParse src/MarkupParse/FlatParse.hs:(135,1)-(146,4) 14.3 5.5 eq MarkupParse.FlatParse src/MarkupParse/FlatParse.hs:243:1-30 10.6 11.1 gather MarkupParse src/MarkupParse.hs:(420,1)-(428,100) 2.4 3.7 runParser FlatParse.Basic src/FlatParse/Basic.hs:(217,1)-(225,24) 1.0 6.0 #+end_example

*** heap analysis (-hc -l)

#+begin_src sh :results output eventlog2html speed0.eventlog #+end_src

Produces speed0.eventlog.html which contains heap charts.

** Cache speed

The average cycles per + operation can get down to about 0.7 cycles, and there are about 4 cache registers per cycle, so a sum pipeline uses 2.8 register instructions per +.

Cache	Cycles
register	4 per cycle
L1 Cache access	3-4 cycles
L2 Cache access	11-12 cycles
L3 unified access	30 - 40
DRAM hit	195 cycles
L1 miss	40 cycles
L2 miss	>600 cycles

perf
perf copied to clipboard

Metadata

← Metadata

Owner

Metadata

perf perf copied to clipboard

Metadata

← Metadata

Owner

Metadata

perf
perf copied to clipboard