foldl Performance of Scan?

Recently I was benchmarking various versions of what essentially amounts to:

scan_add = scanl1 (+)

I wondered how the foldl does at this task and so ran benchmarking on the input [0..10^6]. I tried the following implementation:

foldl_scan = scan (postscan L.sum)

Criterion gives the following output:

benchmarking scans/scanl
time                 10.87 ms   (10.78 ms .. 11.00 ms)
                     0.999 R²   (0.998 R² .. 1.000 R²)
mean                 10.93 ms   (10.88 ms .. 11.03 ms)
std dev              182.5 μs   (115.2 μs .. 336.6 μs)

benchmarking scans/foldl_scan
time                 108.1 ms   (102.5 ms .. 115.9 ms)
                     0.997 R²   (0.992 R² .. 1.000 R²)
mean                 112.1 ms   (110.0 ms .. 114.5 ms)
std dev              3.523 ms   (2.842 ms .. 4.492 ms)
variance introduced by outliers: 11% (moderately inflated)

Is performance like this on an example like this to be expected?

I should note that what I really want is to have:

runningTotal :: Num a => [a] -> ([a], a)

>>> runningTotal [1,2,3] = ([1,3,6], 6)

I don't know how to do this in one pass with the foldl library but doing it naively in haskell (as the foldl documentation suggests one shouldn't do):

dumb :: Num a => [a] -> (a, [a])
dumb xs = let adds = scanl1 (+) xs in (sum xs, adds)

gives extremely good performance on the same benchmark:

time                 14.86 ms   (14.77 ms .. 14.95 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 14.83 ms   (14.79 ms .. 14.88 ms)
std dev              112.8 μs   (90.04 μs .. 140.8 μs)

Jul 22 '19 14:07 Boarders

@Boarders: The thing that the foldl library is optimizing for is maximum memory residence since the library is built to compute things in constant space, so the real thing to compare would be heap usage for the two programs

Jul 25 '19 02:07 Gabriella439

@Gabriel439 : Thanks for the reply (this is also a very nice library btw!)! I did try running the examples from the documentation (i.e. the various sumAndLength functions) using an input list [1..10^6]. On my machine I get these numbers:

benchmarking folds/sumAndLength
time                 6.251 ms   (6.180 ms .. 6.369 ms)
                     0.997 R²   (0.991 R² .. 0.999 R²)
mean                 6.352 ms   (6.304 ms .. 6.436 ms)
std dev              194.9 μs   (117.1 μs .. 320.2 μs)
allocated:           0.434 R²   (0.222 R² .. 0.735 R²)
  iters              97.634     (67.423 .. 133.574)
  y                  2598.304   (1829.205 .. 3547.639)
variance introduced by outliers: 13% (moderately inflated)

benchmarking folds/sumAndLength'
time                 113.6 ms   (103.3 ms .. 127.7 ms)
                     0.989 R²   (0.958 R² .. 0.999 R²)
mean                 127.4 ms   (121.5 ms .. 132.6 ms)
std dev              8.730 ms   (6.654 ms .. 10.73 ms)
allocated:           1.000 R²   (1.000 R² .. 1.000 R²)
  iters              8.123e7    (8.123e7 .. 8.123e7)
  y                  2531.429   (1840.000 .. 6804.103)
variance introduced by outliers: 12% (moderately inflated)

benchmarking folds/sumAndLength_Pair
time                 3.337 ms   (3.305 ms .. 3.378 ms)
                     0.997 R²   (0.994 R² .. 0.999 R²)
mean                 3.477 ms   (3.435 ms .. 3.531 ms)
std dev              154.2 μs   (116.6 μs .. 201.8 μs)
allocated:           0.362 R²   (0.151 R² .. 0.595 R²)
  iters              55.297     (33.753 .. 77.049)
  y                  2602.037   (1911.721 .. 3400.515)
variance introduced by outliers: 26% (moderately inflated)

benchmarking folds/sumAndLength_foldl
time                 7.706 ms   (7.215 ms .. 8.198 ms)
                     0.978 R²   (0.958 R² .. 0.992 R²)
mean                 7.414 ms   (7.263 ms .. 7.717 ms)
std dev              598.4 μs   (381.5 μs .. 960.4 μs)
allocated:           1.000 R²   (1.000 R² .. 1.000 R²)
  iters              5.600e7    (5.600e7 .. 5.600e7)
  y                  2630.881   (1712.713 .. 3742.472)
variance introduced by outliers: 45% (moderately inflated)

With implementations:

sumAndLength :: Num a => [a] -> (a, Int)
sumAndLength xs = (sum xs, length xs)

sumAndLength' :: Num a => [a] -> (a, Int)
sumAndLength' xs = foldl' step (0, 0) xs
  where
    step (x, y) n = (x + n, y + 1)

data Pair a b = Pair !a !b

sumAndLength_Pair :: Num a => [a] -> (a, Int)
sumAndLength_Pair xs = done (foldl' step (Pair 0 0) xs)
  where
    step (Pair x y) n = Pair (x + n) (y + 1)

    done (Pair x y) = (x, y)


sumAndLength_foldl = L.fold ((,) <$> L.sum <*> L.length)

Is this the correct measure for the library and if so are these numbers to be expected? I would be happy to make a pull request with these benchmarks added if they are relevant to the library.

EDIT: I got the numbers wrong but the point remains largely identical.

Jul 26 '19 19:07 Boarders

@Boarders: Yeah, that is the correct measurement. I think most of the difference in performance is due to L.fold internally using "foldl' implemented in terms of foldr", which sometimes has a worse constant factor than Data.List.foldl', but exposes more opportunities for fusion

I'd also welcome adding the benchmarks to the library

Jul 27 '19 01:07 Gabriella439

Running this on the list [1..10^7] and then on the list [1..10^8] gives the following:

[1..10^7]

benchmarking folds/sumAndLength
time                 53.07 ms   (52.21 ms .. 53.95 ms)
                     0.999 R²   (0.996 R² .. 1.000 R²)
mean                 56.26 ms   (55.16 ms .. 58.80 ms)
std dev              2.952 ms   (1.306 ms .. 4.894 ms)
allocated:           0.018 R²   (0.000 R² .. 1.000 R²)
  iters              64.923     (-278.199 .. 400.204)
  y                  3058.667   (562.886 .. 5332.815)
variance introduced by outliers: 15% (moderately inflated)

benchmarking folds/sumAndLength'
time                 1.182 s    (725.8 ms .. 1.551 s)
                     0.983 R²   (0.938 R² .. 1.000 R²)
mean                 1.135 s    (1.059 s .. 1.212 s)
std dev              90.70 ms   (70.05 ms .. 106.5 ms)
allocated:           1.000 R²   (1.000 R² .. 1.000 R²)
  iters              8.123e8    (8.123e8 .. NaN)
  y                  1840.000   (-3496.000 .. 15080.000)
variance introduced by outliers: 21% (moderately inflated)

benchmarking folds/sumAndLength_Pair
time                 29.67 ms   (29.60 ms .. 29.77 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 29.72 ms   (29.68 ms .. 29.78 ms)
std dev              99.76 μs   (71.03 μs .. 147.6 μs)
allocated:           0.007 R²   (0.000 R² .. 1.000 R²)
  iters              25.333     (-134.912 .. 165.825)
  y                  2806.824   (1544.843 .. 4540.874)

benchmarking folds/sumAndLength_foldl
time                 64.28 ms   (64.08 ms .. 64.55 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 64.17 ms   (64.06 ms .. 64.29 ms)
std dev              206.6 μs   (146.3 μs .. 287.0 μs)
allocated:           1.000 R²   (1.000 R² .. 1.000 R²)
  iters              5.600e8    (5.600e8 .. 5.600e8)
  y                  3388.800   (1840.000 .. 5939.765)

[1..10^8]

benchmarking folds/sumAndLength
time                 549.6 ms   (544.3 ms .. 556.4 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 545.0 ms   (542.2 ms .. 546.6 ms)
std dev              2.744 ms   (1.217 ms .. 3.832 ms)
allocated:           0.003 R²   (0.001 R² .. 1.000 R²)
  iters              18.400     (-752.000 .. 960.000)
  y                  2268.000   (984.000 .. 4408.000)
variance introduced by outliers: 19% (moderately inflated)

benchmarking folds/sumAndLength'
time                 11.45 s    (9.275 s .. 14.55 s)
                     0.992 R²   (0.976 R² .. 1.000 R²)
mean                 11.66 s    (11.03 s .. 12.41 s)
std dev              765.8 ms   (322.2 ms .. 1.060 s)
allocated:           1.000 R²   (1.000 R² .. 1.000 R²)
  iters              8.123e9    (8.123e9 .. 8.123e9)
  y                  6752.000   (2696.000 .. 10808.000)
variance introduced by outliers: 19% (moderately inflated)

benchmarking folds/sumAndLength_Pair
time                 313.5 ms   (301.8 ms .. 330.1 ms)
                     0.999 R²   (0.996 R² .. 1.000 R²)
mean                 303.8 ms   (299.2 ms .. 311.6 ms)
std dev              7.006 ms   (3.634 ms .. 9.273 ms)
allocated:           0.535 R²   (0.385 R² .. 1.000 R²)
  iters              830.400    (56.000 .. 2234.000)
  y                  291.200    (-5420.000 .. 1840.000)
variance introduced by outliers: 16% (moderately inflated)

benchmarking folds/sumAndLength_foldl
time                 659.3 ms   (NaN s .. 704.5 ms)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 657.6 ms   (649.2 ms .. 665.6 ms)
std dev              9.581 ms   (6.488 ms .. 11.38 ms)
allocated:           1.000 R²   (1.000 R² .. 1.000 R²)
  iters              5.600e9    (5.600e9 .. 5.600e9)
  y                  4740.000   (-536.000 .. 14104.000)
variance introduced by outliers: 19% (moderately inflated)

This makes it seem like the foldl library definitely does not compute things in constant memory.

Jul 27 '19 22:07 Boarders

The benchmark that I wrote for this is here. When I get some free time I will add something similar to the library. It does look though like there is a space leak in the code.

Jul 30 '19 20:07 Boarders

Your first example uses scan; there was a space leak in scan and scanM that I just fixed (See #137). Your other examples are folds, so I can't comment on those.

Oct 13 '20 04:10 agrue