Performance of Scan?
Recently I was benchmarking various versions of what essentially amounts to:
scan_add = scanl1 (+)
I wondered how the foldl does at this task and so ran benchmarking on the input [0..10^6]. I tried the following implementation:
foldl_scan = scan (postscan L.sum)
Criterion gives the following output:
benchmarking scans/scanl
time 10.87 ms (10.78 ms .. 11.00 ms)
0.999 R² (0.998 R² .. 1.000 R²)
mean 10.93 ms (10.88 ms .. 11.03 ms)
std dev 182.5 μs (115.2 μs .. 336.6 μs)
benchmarking scans/foldl_scan
time 108.1 ms (102.5 ms .. 115.9 ms)
0.997 R² (0.992 R² .. 1.000 R²)
mean 112.1 ms (110.0 ms .. 114.5 ms)
std dev 3.523 ms (2.842 ms .. 4.492 ms)
variance introduced by outliers: 11% (moderately inflated)
Is performance like this on an example like this to be expected?
I should note that what I really want is to have:
runningTotal :: Num a => [a] -> ([a], a)
>>> runningTotal [1,2,3] = ([1,3,6], 6)
I don't know how to do this in one pass with the foldl library but doing it naively in haskell (as the foldl documentation suggests one shouldn't do):
dumb :: Num a => [a] -> (a, [a])
dumb xs = let adds = scanl1 (+) xs in (sum xs, adds)
gives extremely good performance on the same benchmark:
time 14.86 ms (14.77 ms .. 14.95 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 14.83 ms (14.79 ms .. 14.88 ms)
std dev 112.8 μs (90.04 μs .. 140.8 μs)
@Boarders: The thing that the foldl library is optimizing for is maximum memory residence since the library is built to compute things in constant space, so the real thing to compare would be heap usage for the two programs
@Gabriel439 : Thanks for the reply (this is also a very nice library btw!)! I did try running the examples from the documentation (i.e. the various sumAndLength functions) using an input list [1..10^6]. On my machine I get these numbers:
benchmarking folds/sumAndLength
time 6.251 ms (6.180 ms .. 6.369 ms)
0.997 R² (0.991 R² .. 0.999 R²)
mean 6.352 ms (6.304 ms .. 6.436 ms)
std dev 194.9 μs (117.1 μs .. 320.2 μs)
allocated: 0.434 R² (0.222 R² .. 0.735 R²)
iters 97.634 (67.423 .. 133.574)
y 2598.304 (1829.205 .. 3547.639)
variance introduced by outliers: 13% (moderately inflated)
benchmarking folds/sumAndLength'
time 113.6 ms (103.3 ms .. 127.7 ms)
0.989 R² (0.958 R² .. 0.999 R²)
mean 127.4 ms (121.5 ms .. 132.6 ms)
std dev 8.730 ms (6.654 ms .. 10.73 ms)
allocated: 1.000 R² (1.000 R² .. 1.000 R²)
iters 8.123e7 (8.123e7 .. 8.123e7)
y 2531.429 (1840.000 .. 6804.103)
variance introduced by outliers: 12% (moderately inflated)
benchmarking folds/sumAndLength_Pair
time 3.337 ms (3.305 ms .. 3.378 ms)
0.997 R² (0.994 R² .. 0.999 R²)
mean 3.477 ms (3.435 ms .. 3.531 ms)
std dev 154.2 μs (116.6 μs .. 201.8 μs)
allocated: 0.362 R² (0.151 R² .. 0.595 R²)
iters 55.297 (33.753 .. 77.049)
y 2602.037 (1911.721 .. 3400.515)
variance introduced by outliers: 26% (moderately inflated)
benchmarking folds/sumAndLength_foldl
time 7.706 ms (7.215 ms .. 8.198 ms)
0.978 R² (0.958 R² .. 0.992 R²)
mean 7.414 ms (7.263 ms .. 7.717 ms)
std dev 598.4 μs (381.5 μs .. 960.4 μs)
allocated: 1.000 R² (1.000 R² .. 1.000 R²)
iters 5.600e7 (5.600e7 .. 5.600e7)
y 2630.881 (1712.713 .. 3742.472)
variance introduced by outliers: 45% (moderately inflated)
With implementations:
sumAndLength :: Num a => [a] -> (a, Int)
sumAndLength xs = (sum xs, length xs)
sumAndLength' :: Num a => [a] -> (a, Int)
sumAndLength' xs = foldl' step (0, 0) xs
where
step (x, y) n = (x + n, y + 1)
data Pair a b = Pair !a !b
sumAndLength_Pair :: Num a => [a] -> (a, Int)
sumAndLength_Pair xs = done (foldl' step (Pair 0 0) xs)
where
step (Pair x y) n = Pair (x + n) (y + 1)
done (Pair x y) = (x, y)
sumAndLength_foldl = L.fold ((,) <$> L.sum <*> L.length)
Is this the correct measure for the library and if so are these numbers to be expected? I would be happy to make a pull request with these benchmarks added if they are relevant to the library.
EDIT: I got the numbers wrong but the point remains largely identical.
@Boarders: Yeah, that is the correct measurement. I think most of the difference in performance is due to L.fold internally using "foldl' implemented in terms of foldr", which sometimes has a worse constant factor than Data.List.foldl', but exposes more opportunities for fusion
I'd also welcome adding the benchmarks to the library
Running this on the list [1..10^7] and then on the list [1..10^8] gives the following:
[1..10^7]
benchmarking folds/sumAndLength
time 53.07 ms (52.21 ms .. 53.95 ms)
0.999 R² (0.996 R² .. 1.000 R²)
mean 56.26 ms (55.16 ms .. 58.80 ms)
std dev 2.952 ms (1.306 ms .. 4.894 ms)
allocated: 0.018 R² (0.000 R² .. 1.000 R²)
iters 64.923 (-278.199 .. 400.204)
y 3058.667 (562.886 .. 5332.815)
variance introduced by outliers: 15% (moderately inflated)
benchmarking folds/sumAndLength'
time 1.182 s (725.8 ms .. 1.551 s)
0.983 R² (0.938 R² .. 1.000 R²)
mean 1.135 s (1.059 s .. 1.212 s)
std dev 90.70 ms (70.05 ms .. 106.5 ms)
allocated: 1.000 R² (1.000 R² .. 1.000 R²)
iters 8.123e8 (8.123e8 .. NaN)
y 1840.000 (-3496.000 .. 15080.000)
variance introduced by outliers: 21% (moderately inflated)
benchmarking folds/sumAndLength_Pair
time 29.67 ms (29.60 ms .. 29.77 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 29.72 ms (29.68 ms .. 29.78 ms)
std dev 99.76 μs (71.03 μs .. 147.6 μs)
allocated: 0.007 R² (0.000 R² .. 1.000 R²)
iters 25.333 (-134.912 .. 165.825)
y 2806.824 (1544.843 .. 4540.874)
benchmarking folds/sumAndLength_foldl
time 64.28 ms (64.08 ms .. 64.55 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 64.17 ms (64.06 ms .. 64.29 ms)
std dev 206.6 μs (146.3 μs .. 287.0 μs)
allocated: 1.000 R² (1.000 R² .. 1.000 R²)
iters 5.600e8 (5.600e8 .. 5.600e8)
y 3388.800 (1840.000 .. 5939.765)
[1..10^8]
benchmarking folds/sumAndLength
time 549.6 ms (544.3 ms .. 556.4 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 545.0 ms (542.2 ms .. 546.6 ms)
std dev 2.744 ms (1.217 ms .. 3.832 ms)
allocated: 0.003 R² (0.001 R² .. 1.000 R²)
iters 18.400 (-752.000 .. 960.000)
y 2268.000 (984.000 .. 4408.000)
variance introduced by outliers: 19% (moderately inflated)
benchmarking folds/sumAndLength'
time 11.45 s (9.275 s .. 14.55 s)
0.992 R² (0.976 R² .. 1.000 R²)
mean 11.66 s (11.03 s .. 12.41 s)
std dev 765.8 ms (322.2 ms .. 1.060 s)
allocated: 1.000 R² (1.000 R² .. 1.000 R²)
iters 8.123e9 (8.123e9 .. 8.123e9)
y 6752.000 (2696.000 .. 10808.000)
variance introduced by outliers: 19% (moderately inflated)
benchmarking folds/sumAndLength_Pair
time 313.5 ms (301.8 ms .. 330.1 ms)
0.999 R² (0.996 R² .. 1.000 R²)
mean 303.8 ms (299.2 ms .. 311.6 ms)
std dev 7.006 ms (3.634 ms .. 9.273 ms)
allocated: 0.535 R² (0.385 R² .. 1.000 R²)
iters 830.400 (56.000 .. 2234.000)
y 291.200 (-5420.000 .. 1840.000)
variance introduced by outliers: 16% (moderately inflated)
benchmarking folds/sumAndLength_foldl
time 659.3 ms (NaN s .. 704.5 ms)
0.999 R² (0.999 R² .. 1.000 R²)
mean 657.6 ms (649.2 ms .. 665.6 ms)
std dev 9.581 ms (6.488 ms .. 11.38 ms)
allocated: 1.000 R² (1.000 R² .. 1.000 R²)
iters 5.600e9 (5.600e9 .. 5.600e9)
y 4740.000 (-536.000 .. 14104.000)
variance introduced by outliers: 19% (moderately inflated)
This makes it seem like the foldl library definitely does not compute things in constant memory.
The benchmark that I wrote for this is here. When I get some free time I will add something similar to the library. It does look though like there is a space leak in the code.
Your first example uses scan; there was a space leak in scan and scanM that I just fixed (See #137). Your other examples are folds, so I can't comment on those.