haddock
haddock copied to clipboard
Add `-rtsopts` to the build flags
haddock -rtsopts
tuning the gc options can get big improvements in performance when compiling with ghc. since haddock is doing a compilation, it stands to reason that more resources can similarly improve performance.
the following are runs on the work codebase.
without any options
this run used no options aside from +rts -s -rts to dump output:
1,604,466,578,392 bytes allocated in the heap
386,407,308,656 bytes copied during gc
12,648,347,528 bytes maximum residency (79 sample(s))
59,820,152 bytes maximum slop
36014 mib total memory in use (0 mb lost due to fragmentation)
tot time (elapsed) avg pause max pause
gen 0 373677 colls, 0 par 169.848s 170.071s 0.0005s 0.0061s
gen 1 79 colls, 0 par 138.591s 138.638s 1.7549s 11.6650s
tasks: 5 (1 bound, 4 peak workers (4 total), using -n1)
sparks: 0 (0 converted, 0 overflowed, 0 dud, 0 gc'd, 0 fizzled)
init time 0.001s ( 0.000s elapsed)
mut time 551.120s (619.349s elapsed)
gc time 308.439s (308.709s elapsed)
exit time 0.004s ( 0.003s elapsed)
total time 859.564s (928.060s elapsed)
alloc rate 2,911,282,964 bytes per mut second
productivity 64.1% of total user, 66.7% of total elapsed
time output shows 15:30 runtime.
without rtsopts, with -j20
my laptop has 20 cores, so running with -j20 gives a decent boost:
1,603,850,165,040 bytes allocated in the heap
400,668,426,664 bytes copied during gc
12,212,396,328 bytes maximum residency (76 sample(s))
58,748,632 bytes maximum slop
34914 mib total memory in use (0 mb lost due to fragmentation)
tot time (elapsed) avg pause max pause
gen 0 218665 colls, 205375 par 377.040s 169.684s 0.0008s 0.0236s
gen 1 76 colls, 66 par 604.065s 78.588s 1.0341s 6.0503s
parallel gc work balance: 49.53% (serial 0%, perfect 100%)
tasks: 74 (1 bound, 73 peak workers (73 total), using -n20)
sparks: 6814 (952 converted, 0 overflowed, 0 dud, 1609 gc'd, 4253 fizzled)
init time 0.001s ( 0.000s elapsed)
mut time 1163.585s (336.423s elapsed)
gc time 981.105s (248.272s elapsed)
exit time 1.219s ( 0.005s elapsed)
total time 2145.909s (584.701s elapsed)
alloc rate 1,378,369,787 bytes per mut second
productivity 54.2% of total user, 57.5% of total elapsed
unfortunately, we run afoul of the garbage collector. productivity is down significantly. we're still considerably faster - time shows 9:41 runtime, a saving of 5:19, 37.5% improvement!
with rtsopts
this invocation used +rts -s -n2m -a128m -qg -rts. i did not tune this in anyway - just the first things i try when i'm playing with rts options for performance.
1,602,827,917,096 bytes allocated in the heap
212,897,340,048 bytes copied during gc
12,731,140,720 bytes maximum residency (29 sample(s))
60,183,952 bytes maximum slop
36156 mib total memory in use (0 mb lost due to fragmentation)
tot time (elapsed) avg pause max pause
gen 0 11654 colls, 0 par 127.812s 127.867s 0.0110s 0.1192s
gen 1 29 colls, 0 par 86.436s 86.458s 2.9813s 11.5267s
tasks: 5 (1 bound, 4 peak workers (4 total), using -n1)
sparks: 0 (0 converted, 0 overflowed, 0 dud, 0 gc'd, 0 fizzled)
init time 0.002s ( 0.001s elapsed)
mut time 608.524s (677.754s elapsed)
gc time 214.247s (214.326s elapsed)
exit time 0.004s ( 0.010s elapsed)
total time 822.777s (892.090s elapsed)
alloc rate 2,633,959,982 bytes per mut second
productivity 74.0% of total user, 76.0% of total elapsed
time output is 14:54 this time - a modest 36 seconds saved, only 3.9% improvement. still, free speed is free speed.
with rtsopts, with -j20
let's add -j20 to the program, so it runs with all my cores.
1,601,853,408,352 bytes allocated in the heap
130,407,095,128 bytes copied during gc
12,556,216,824 bytes maximum residency (24 sample(s))
59,799,048 bytes maximum slop
37863 mib total memory in use (0 mb lost due to fragmentation)
tot time (elapsed) avg pause max pause
gen 0 1196 colls, 0 par 1.665s 89.695s 0.0750s 0.6673s
gen 1 24 colls, 0 par 10.674s 54.364s 2.2652s 11.3638s
tasks: 78 (1 bound, 77 peak workers (77 total), using -n20)
sparks: 6814 (254 converted, 0 overflowed, 0 dud, 13 gc'd, 6547 fizzled)
init time 0.001s ( 0.001s elapsed)
mut time 983.777s (282.517s elapsed)
gc time 12.339s (144.059s elapsed)
exit time 1.102s ( 0.003s elapsed)
total time 997.219s (426.580s elapsed)
alloc rate 1,628,268,955 bytes per mut second
productivity 98.7% of total user, 66.2% of total elapsed
time reports 7:09.
this is 2:32 improvement over the prior parallalel run, another 26% improvement.
overall, we're looking at a 37.5% improvement, just on tuning the runtime parameters a bit.