ndarray Pairwise summation

Motivation

The naive summation algorithm can cause significant loss in precision when summing several floating point numbers - pairwise summation mitigates the issue without adding significant computational overhead.

Ideally, this should only be used when dealing with floats, but unfortunately we can't specialize on the argument type. Mitigating precision loss in floating point sums, on the other hand, is of paramount importance for a lot of algorithms relying on it as a building block - given that the overhead is minimal, I'd argue that the precision benefit is sufficient to adopt it as standard summation method.

Public API

Nothing changes.

Technicalities

I haven't modified the overall structure in sum and sum_axis: I have just provided some helper functions that isolate the pairwise summation algorithm.

The function that sums slices leverages the available unrolled_fold to vectorise the sum once the vector length is below the size threhold.

The function that sums iterators falls back on the slices summation function as soon as the first pass is over, to recover vectorisation given that the array of partial sums is contiguous in memory (only relevant for array with more than 512^2 elements).

The function that sums an iterator of arrays is self-recursive.

It would probably be possible to share the pagination logic (to sum in blocks of 512) between the two functions operating on iterators, but I haven't been able to do it so far.

Jan 04 '19 22:01 LukeMathWalker

Nice. It's not obvious to me that we should use the current implementation, this, or other compensated summing algorithm in the default sums.

We can specialize on the argument type - we only need to take the step to requiring 'static on the scalar type to be able to use type id / "static" / "simple and stupid" specialization which is what for example matrix multiplication does. I think we should take this step. (In future Rust we hope that type id based specialization is usable without this bound.)
This is a scalable solution. Unrolled fold is already using a split into 8 subfactors which should improve precision, but it's only used when we have contiguous input of course.
Not complete without benchmarks of contiguous and other cases

Jan 05 '19 11:01 bluss

The only other compensated summation algorithm that comes to my mind is Kahan's summation algorithm, but its performance overhead is not negligible, hence I wouldn't provide it as a default implementation. Alternatively we could provide different sum methods and let the user choose what they need: this is what Julia does, using pairwise summation in their default sum method (here) while providing a variant of Kahan's summation as an additional method - here. NumPy instead opted to provide pairwise summation as default implementation without an additional method for Kahan's summation - here.

Re 1. - Wouldn't using TypeId + 'static enable this implementation only for float primitives, f32 and f64? There are structs that implement the Float trait while encapsulating a float primitive to provide additional invariants (e.g. N64/N32 and R64/R32 in noisy_floats - docs) that would not be specialized unless we can specialize to types that implement the Float trait, which as far as I understand is not currently possible. Alternatively, we could have a list of "common" float wrappers and specialize for all of them, but it doesn't sound very scalable.

Re. 3 - Ok, I'll work on it and add them to the PR.

Jan 05 '19 11:01 LukeMathWalker

Running these benchmarks I get:

 name                    master ns/iter  branch ns/iter  diff ns/iter  diff %  speedup 
 contiguous_sum_1e2      16               16                         0   0.00%   x 1.00 
 contiguous_sum_1e4      1,223            1,415                    192  15.70%   x 0.86 
 contiguous_sum_1e7      3,956,575        4,089,350            132,775   3.36%   x 0.97 
 contiguous_sum_int_1e4  1,568            1,938                    370  23.60%   x 0.81
 sum_by_col_1e4          297,202          290,586               -6,616  -2.23%   x 1.02 
 sum_by_row_1e4          431,325          425,512               -5,813  -1.35%   x 1.01

Never done serious benchmarking before, so take these results with a grain of salt.

Jan 05 '19 13:01 LukeMathWalker

@LukeMathWalker Thanks for working on this! Pairwise summation is better than the current implementation.

I would hope we'd be able to use pairwise summation without significant performance cost. I'm surprised that the differences in some of those benchmarks are as large as they are.

A few thoughts:

I think we should use pairwise summation as the implementation of .sum(). It's the default for NumPy and Julia, and it seems like a clear win if the performance cost is small.
We can specialize on integer types to avoid pairwise summation for them if that turns out to provide a significant performance benefit. However, the general case should be pairwise, because as @LukeMathWalker pointed out, there are many types other than f32 and f64 that need the pairwise behavior for accuracy.

It's unsatisfying to add an array_pairwise_sum function just for sum_axis. How about something like the example below instead? I would hope the performance would be similar.

pub fn sum_axis(&self, axis: Axis) -> Array<A, D::Smaller>
where
    A: Clone + Zero + Add<Output = A>,
    D: RemoveAxis,
{
    let mut out = Array::zeros(self.dim.remove_axis(axis));
    Zip::from(&mut out)
        .and(self.lanes(axis))
        .apply(|out, lane| *out = lane.sum());
    out
}

We should make sure that pairwise_sum and iterator_pairwise_sum add up the same number of values in the base case. In the current version of this PR, they don't, since pairwise_sum calls unrolled_fold which performs an eightfold unrolled sum. To make sure pairwise_sum and iterator_pairwise_sum add up the same number of values in the base case, iterator_pairwise_sum needs to divide NAIVE_SUM_THRESHOLD by 8.
It would be nice for .product() to be pairwise too. We could add a .pairwise_reduce() method for ArrayBase that would assume that the operation f is associative and operate in a pairwise way, and then use that method to implement both .sum() and .product(). Assuming the compiler inlines sufficiently aggressively, this shouldn't have a performance cost.
```
impl<A, S, D> ArrayBase<S, D>
where
    S: RawData<Elem = A>,
    D: Dimension,
{
    fn pairwise_reduce(&self, base: usize, id: A, f: F) -> A
    where
        F: Fn(A, A) -> A,
        A: Clone,
    {}
}
```

Jan 08 '19 20:01 jturner314

We can specialize on integer types to avoid pairwise summation for them if that turns out to provide a significant performance benefit. However, the general case should be pairwise, because as @LukeMathWalker pointed out, there are many types other than f32 and f64 that need the pairwise behavior for accuracy.

We will still fall short on new types wrapping integer values, but this seems like a better solution. If we plan on specializing, we might also take advantage of the fact that, depending on the byte size, we might get more integers packed together using SIMD compared to floats. But I guess this falls under the broader discussion on how to leverage SIMD instructions in ndarray.

It's unsatisfying to add an array_pairwise_sum function just for sum_axis. How about something like the example below instead? I would hope the performance would be similar.

I tried it out and this is the performance I get:

name            master ns/iter   branch ns/iter  diff ns/iter   diff %  speedup 
sum_by_col_1e4  287,381          303,410               16,029    5.58%   x 0.95 
sum_by_row_1e4  420,978          1,019,645            598,667  142.21%   x 0.41

We should make sure that pairwise_sum and iterator_pairwise_sum add up the same number of values in the base case. In the current version of this PR, they don't, since pairwise_sum calls unrolled_fold which performs an eightfold unrolled sum. To make sure pairwise_sum and iterator_pairwise_sum add up the same number of values in the base case, iterator_pairwise_sum needs to divide NAIVE_SUM_THRESHOLD by 8.

Good catch - I'll fix it.

It would be nice for .product() to be pairwise too.

Would it really be beneficial? When doing long products what I usually fear is overflowing/underflowing. To get around that, I use the exp-ln trick, as in https://github.com/jturner314/ndarray-stats/pull/20 for geometric_mean.

Jan 09 '19 08:01 LukeMathWalker

I took another look at .sum_axis() and decided to benchmark this time instead of just guessing. :) This has slightly better performance on my machine than the PR for the sum_by_col_1e4 and sum_by_row_1e4 benchmarks and is more concise:

    pub fn sum_axis(&self, axis: Axis) -> Array<A, D::Smaller>
        where A: Clone + Zero + Add<Output=A>,
              D: RemoveAxis,
    {
        let n = self.len_of(axis);
        let stride = self.strides()[axis.index()];
        if self.ndim() == 2 && stride == 1 {
            // contiguous along the axis we are summing
            let mut res = Array::zeros(self.raw_dim().remove_axis(axis));
            let ax = axis.index();
            for (i, elt) in enumerate(&mut res) {
                *elt = self.index_axis(Axis(1 - ax), i).sum();
            }
            res
        } else if self.len_of(axis) <= numeric_util::NO_SIMD_NAIVE_SUM_THRESHOLD {
            self.fold_axis(axis, A::zero(), |acc, x| acc.clone() + x.clone())
        } else {
            let (v1, v2) = self.view().split_at(axis, n / 2);
            v1.sum_axis(axis) + v2.sum_axis(axis)
        }
    }

To make sure pairwise_sum and iterator_pairwise_sum add up the same number of values in the base case, iterator_pairwise_sum needs to divide NAIVE_SUM_THRESHOLD by 8.

Actually, there's another issue I didn't catch earlier. In the base case, pairwise_sum adds 64 values, and then for higher levels, adds pairs of sums. In contrast, in the base case, iterator_pairwise_sum adds 64 values (computing partial_sums), then in the next higher level adds groups of 64 sums (in the pairwise_sum call), and then for the higher levels adds pairs of sums. In other words, iterator_pairwise_sum has two levels of summing groups of 64 values, not just the base level, so it needs to be changed to match pairwise_sum.

I also noticed that we technically need to implement unrolled_fold pairwise with something like this:

pub fn unrolled_fold<A, I, F>(mut xs: &[A], init: I, f: F) -> A
    where A: Clone,
    I: Fn() -> A,
    F: Fn(A, A) -> A,
{
    // eightfold unrolled so that floating point can be vectorized
    // (even with strict floating point accuracy semantics)
    let (mut p0, mut p1, mut p2, mut p3,
         mut p4, mut p5, mut p6, mut p7) =
        (init(), init(), init(), init(),
         init(), init(), init(), init());
    while xs.len() >= 8 {
        p0 = f(p0, xs[0].clone());
        p1 = f(p1, xs[1].clone());
        p2 = f(p2, xs[2].clone());
        p3 = f(p3, xs[3].clone());
        p4 = f(p4, xs[4].clone());
        p5 = f(p5, xs[5].clone());
        p6 = f(p6, xs[6].clone());
        p7 = f(p7, xs[7].clone());

        xs = &xs[8..];
    }
    let (q0, q1, q2, q3) = (f(p0, p4), f(p1, p5), f(p2, p6), f(p3, p7));
    let (r0, r1) = (f(q0, q2), f(q1, q3));
    let unrolled = f(r0, r1);

    // make it clear to the optimizer that this loop is short
    // and can not be autovectorized.
    let mut partial = init();
    for i in 0..xs.len() {
        if i >= 7 { break; }
        partial = f(partial.clone(), xs[i].clone())
    }

    f(unrolled, partial)
}

The performance difference is within the +/- bounds of the benchmarks.

By the way, after more thought, I'd prefer to rename NO_SIMD_NAIVE_SUM_THRESHOLD to NAIVE_SUM_THRESHOLD because that's really the base case, and add a constant UNROLL_ACCUMULATORS or UNROLL_DEGREE of 8. Then, the condition for pairwise_sum would be n <= NAIVE_SUM_THRESHOLD * UNROLL_ACCUMULATORS.

It would be nice for .product() to be pairwise too.

Would it really be beneficial?

I assumed that floating-point multiplication would have the same issue with error accumulation as addition, but I don't really know. We can keep .product() as-is for now.

Jan 11 '19 01:01 jturner314

Ok, I'll make the required changes to have a uniform behaviour.

One observation: out of curiosity, I have repeated the benchmarks on a different machine (my desktop has a AMD Ryzen 1700, while this time I am running them on my office laptop, which mounts an Intel i9-8950HK).

Laptop (Intel):

 name                    master ns/iter  branch ns/iter  diff ns/iter  diff %  speedup
 contiguous_sum_1e2      16              16                         0   0.00%   x 1.00
 contiguous_sum_1e4      1,355           1,435                     80   5.90%   x 0.94
 contiguous_sum_1e7      3,433,483       3,489,613             56,130   1.63%   x 0.98
 contiguous_sum_int_1e4  2,338           2,294                    -44  -1.88%   x 1.02
 sum_by_col_1e4          213,243         204,285               -8,958  -4.20%   x 1.04 
 sum_by_row_1e4          326,246         326,854                  608   0.19%   x 1.00

Desktop (AMD): [for ease of comparison]

 name                    master ns/iter  branch ns/iter  diff ns/iter  diff %  speedup 
 contiguous_sum_1e2      16               16                         0   0.00%   x 1.00 
 contiguous_sum_1e4      1,223            1,415                    192  15.70%   x 0.86 
 contiguous_sum_1e7      3,956,575        4,089,350            132,775   3.36%   x 0.97 
 contiguous_sum_int_1e4  1,568            1,938                    370  23.60%   x 0.81
 sum_by_col_1e4          297,202          290,586               -6,616  -2.23%   x 1.02 
 sum_by_row_1e4          431,325          425,512               -5,813  -1.35%   x 1.01

Results look quite different - there is little to no impact on performance. In particular, there seems to be no effect on the speed of integer summation.

Jan 16 '19 09:01 LukeMathWalker

I addressed all your observations @jturner314 :+1:

Jan 22 '19 08:01 LukeMathWalker

Hey @LukeMathWalker, it's been a while since I've commented on GitHub, and I just wanted to let you know that I'm not ignoring your PRs on ndarray and ndarray-stats; they look like great functionality. Life is just really busy, and I don't have time to review them right now. I'm hoping things will calm down by next weekend or so.

Jan 24 '19 03:01 jturner314

No need to justify: all of us are doing OSS out of our free time and sometimes life just crashes that free time (or the will to do anything with it). No pressure on this side @jturner314!

Jan 24 '19 08:01 LukeMathWalker

I noticed one more issue -- in the non-contiguous case, the current .sum() implementation in this PR operates pairwise only over the last dimension (since it's naively adding up the pairwise sums of the lanes). I've created LukeMathWalker/ndarray#3 to fix this and make some additional simplifications and performance improvements.

The only other things I'd like to see are:

A few more tests of .sum()/.sum_axis(). In particular, I'd like to see a test with more than 2 dimensions, a test with axis lengths that are larger than NAIVE_SUM_THRESHOLD * UNROLL_SIZE but not evenly divisible by NAIVE_SUM_THRESHOLD or UNROLL_SIZE, and the same for a discontiguous array.
Some benchmarks with integer elements to see whether we need to specialize the implementation for integers.

Edit: I noticed that LukeMathWalker/ndarray#3 causes a performance regression in the middle_discontiguous_sum_ix3_1e2 benchmark since sum no longer takes advantage of pairwise_sum when the last axis is contiguous. I have a fix for this, but it needs some cleanup before I push it.

Feb 03 '19 06:02 jturner314

I looked at https://github.com/LukeMathWalker/ndarray/pull/3 and it seems good to me - I'll merge it and in the meanwhile you can polish the edits required to revert the performance regression you have observed. I'll work on getting more tests in there and some significant benchmarks with integers.

The only thing about https://github.com/LukeMathWalker/ndarray/pull/3 that I was slightly confused about is:

let cap = len.saturating_sub(1) / NAIVE_SUM_THRESHOLD + 1; // ceiling of division

Why are we subtracting 1 from len? :thinking:

Feb 03 '19 14:02 LukeMathWalker

Re-running benchmarks:

Desktop (AMD):

name                                  master ns/iter  branch ns/iter  diff ns/iter   diff %  speedup 
 clip                                  1,730           1,658                    -72   -4.16%   x 1.04 
 contiguous_sum_1e2                    16              14                        -2  -12.50%   x 1.14 
 contiguous_sum_1e4                    1,277           1,479                    202   15.82%   x 0.86 
 contiguous_sum_1e7                    4,057,990       4,187,331            129,341    3.19%   x 0.97 
 contiguous_sum_int_1e2                17              17                         0    0.00%   x 1.00 
 contiguous_sum_int_1e4                1,619           2,013                    394   24.34%   x 0.80 
 contiguous_sum_int_1e7                4,993,451       5,037,513             44,062    0.88%   x 0.99 
 contiguous_sum_int_ix3_1e2            326,134         333,641                7,507    2.30%   x 0.98 
 contiguous_sum_ix3_1e2                283,887         289,471                5,584    1.97%   x 0.98 
 inner_discontiguous_sum_int_ix3_1e2   809,916         1,164,653            354,737   43.80%   x 0.70 
 inner_discontiguous_sum_ix3_1e2       809,432         1,014,721            205,289   25.36%   x 0.80 
 middle_discontiguous_sum_int_ix3_1e2  1,344,886       2,116,789            771,903   57.40%   x 0.64 
 middle_discontiguous_sum_ix3_1e2      1,072,775       1,987,436            914,661   85.26%   x 0.54 
 sum_by_col_1e4                        39,772,711      40,890,980         1,118,269    2.81%   x 0.97 
 sum_by_col_int_1e4                    51,297,923      51,720,879           422,956    0.82%   x 0.99 
 sum_by_middle_1e2                     659,063         682,702               23,639    3.59%   x 0.97 
 sum_by_middle_int_1e2                 645,698         663,506               17,808    2.76%   x 0.97 
 sum_by_row_1e4                        54,742,936      56,152,978         1,410,042    2.58%   x 0.97 
 sum_by_row_int_1e4                    42,806,355      44,406,731         1,600,376    3.74%   x 0.96

Laptop (Intel):

name                                  master ns/iter  branch ns/iter  diff ns/iter   diff %  speedup
 clip                                  1,212           1,215                      3    0.25%   x 1.00
 contiguous_sum_1e2                    15              12                        -3  -20.00%   x 1.25
 contiguous_sum_1e4                    1,236           1,325                     89    7.20%   x 0.93
 contiguous_sum_1e7                    3,170,928       3,425,280            254,352    8.02%   x 0.93
 contiguous_sum_int_1e2                13              12                        -1   -7.69%   x 1.08
 contiguous_sum_int_1e4                2,227           2,196                    -31   -1.39%   x 1.01
 contiguous_sum_int_1e7                4,156,062       4,232,262             76,200    1.83%   x 0.98
 contiguous_sum_int_ix3_1e2            252,058         266,392               14,334    5.69%   x 0.95
 contiguous_sum_ix3_1e2                166,770         183,237               16,467    9.87%   x 0.91
 inner_discontiguous_sum_int_ix3_1e2   543,743         883,764              340,021   62.53%   x 0.62
 inner_discontiguous_sum_ix3_1e2       672,002         880,811              208,809   31.07%   x 0.76
 middle_discontiguous_sum_int_ix3_1e2  549,559         903,988              354,429   64.49%   x 0.61
 middle_discontiguous_sum_ix3_1e2      441,214         989,728              548,514  124.32%   x 0.45
 sum_by_col_1e4                        31,920,821      34,300,217         2,379,396    7.45%   x 0.93
 sum_by_col_int_1e4                    42,802,478      43,464,495           662,017    1.55%   x 0.98
 sum_by_middle_1e2                     385,995         399,979               13,984    3.62%   x 0.97
 sum_by_middle_int_1e2                 398,241         402,326                4,085    1.03%   x 0.99
 sum_by_row_1e4                        43,487,940      45,690,264         2,202,324    5.06%   x 0.95
 sum_by_row_int_1e4                    42,770,556      44,198,493         1,427,937    3.34%   x 0.97

Feb 03 '19 17:02 LukeMathWalker

Okay, I've added my changes in LukeMathWalker/ndarray#4. They should improve performance on the middle_discontiguous benchmarks.

The only thing about LukeMathWalker#3 that I was slightly confused about is:
let cap = len.saturating_sub(1) / NAIVE_SUM_THRESHOLD + 1; // ceiling of division
Why are we subtracting 1 from len?

The goal of this line is to determine the ceiling of len / NAIVE_SUM_THRESHOLD (see Stack Overflow). (We need the ceiling to account for the last few elements in the iterator if len is not evenly divisible by NAIVE_SUM_THRESHOLD.) I've rewritten this line in LukeMathWalker/ndarray#4 to be more clear (and to give zero when len == 0):

let cap = len / NAIVE_SUM_THRESHOLD + if len % NAIVE_SUM_THRESHOLD != 0 { 1 } else { 0 };

The benchmarks are interesting, especially the difference between the two machines. All the results look favorable except for the discontiguous cases. I'm not really sure why they're so much worse. I've tried running perf for profiling but haven't had any success (perf report keeps giving an error message). I've tried a few ideas to improve iterator_pairwise_sum and pure_pairwise_sum but haven't had any success.

Feb 04 '19 02:02 jturner314

The new formulation for cap is much clearer, thanks!

I have reran the benchmarks with your latest changes - the inner discontinuous case is the only one that is clearly suffering right now.

Desktop (AMD):

name                                  master ns/iter  branch ns/iter  diff ns/iter   diff %  speedup 
 clip                                  1,651           1,650                     -1   -0.06%   x 1.00 
 contiguous_sum_1e2                    16              13                        -3  -18.75%   x 1.23 
 contiguous_sum_1e4                    1,224           1,385                    161   13.15%   x 0.88 
 contiguous_sum_1e7                    3,869,448       4,103,742            234,294    6.05%   x 0.94 
 contiguous_sum_int_1e2                17              15                        -2  -11.76%   x 1.13 
 contiguous_sum_int_1e4                1,568           1,931                    363   23.15%   x 0.81 
 contiguous_sum_int_1e7                4,876,993       5,036,355            159,362    3.27%   x 0.97 
 contiguous_sum_int_ix3_1e2            332,453         341,591                9,138    2.75%   x 0.97 
 contiguous_sum_ix3_1e2                274,877         293,840               18,963    6.90%   x 0.94 
 inner_discontiguous_sum_int_ix3_1e2   772,709         1,116,399            343,690   44.48%   x 0.69 
 inner_discontiguous_sum_ix3_1e2       762,707         1,013,607            250,900   32.90%   x 0.75 
 middle_discontiguous_sum_int_ix3_1e2  1,162,295       1,425,353            263,058   22.63%   x 0.82 
 middle_discontiguous_sum_ix3_1e2      934,716         1,136,989            202,273   21.64%   x 0.82 
 sum_by_col_1e4                        38,968,247      40,410,507         1,442,260    3.70%   x 0.96 
 sum_by_col_int_1e4                    50,080,514      51,510,736         1,430,222    2.86%   x 0.97 
 sum_by_middle_1e2                     703,151         662,579              -40,572   -5.77%   x 1.06 
 sum_by_middle_int_1e2                 637,496         644,015                6,519    1.02%   x 0.99 
 sum_by_row_1e4                        54,319,725      55,886,431         1,566,706    2.88%   x 0.97 
 sum_by_row_int_1e4                    42,122,728      43,521,191         1,398,463    3.32%   x 0.97

(Intel benchmarks coming later)

Feb 04 '19 08:02 LukeMathWalker

What's the status of this PR?

Jun 28 '20 20:06 tanriol

ndarray ndarray copied to clipboard

Pairwise summation

Motivation

Public API

Technicalities

ndarray
ndarray copied to clipboard