ndarray
ndarray copied to clipboard
Pairwise summation
Motivation
The naive summation algorithm can cause significant loss in precision when summing several floating point numbers - pairwise summation mitigates the issue without adding significant computational overhead.
Ideally, this should only be used when dealing with floats, but unfortunately we can't specialize on the argument type. Mitigating precision loss in floating point sums, on the other hand, is of paramount importance for a lot of algorithms relying on it as a building block - given that the overhead is minimal, I'd argue that the precision benefit is sufficient to adopt it as standard summation method.
Public API
Nothing changes.
Technicalities
I haven't modified the overall structure in sum and sum_axis: I have just provided some helper functions that isolate the pairwise summation algorithm.
The function that sums slices leverages the available unrolled_fold to vectorise the sum once the vector length is below the size threhold.
The function that sums iterators falls back on the slices summation function as soon as the first pass is over, to recover vectorisation given that the array of partial sums is contiguous in memory (only relevant for array with more than 512^2 elements).
The function that sums an iterator of arrays is self-recursive.
It would probably be possible to share the pagination logic (to sum in blocks of 512) between the two functions operating on iterators, but I haven't been able to do it so far.
Nice. It's not obvious to me that we should use the current implementation, this, or other compensated summing algorithm in the default sums.
- We can specialize on the argument type - we only need to take the step to requiring
'staticon the scalar type to be able to use type id / "static" / "simple and stupid" specialization which is what for example matrix multiplication does. I think we should take this step. (In future Rust we hope that type id based specialization is usable without this bound.) - This is a scalable solution. Unrolled fold is already using a split into 8 subfactors which should improve precision, but it's only used when we have contiguous input of course.
- Not complete without benchmarks of contiguous and other cases
The only other compensated summation algorithm that comes to my mind is Kahan's summation algorithm, but its performance overhead is not negligible, hence I wouldn't provide it as a default implementation.
Alternatively we could provide different sum methods and let the user choose what they need: this is what Julia does, using pairwise summation in their default sum method (here) while providing a variant of Kahan's summation as an additional method - here. NumPy instead opted to provide pairwise summation as default implementation without an additional method for Kahan's summation - here.
Re 1. - Wouldn't using TypeId + 'static enable this implementation only for float primitives, f32 and f64? There are structs that implement the Float trait while encapsulating a float primitive to provide additional invariants (e.g. N64/N32 and R64/R32 in noisy_floats - docs) that would not be specialized unless we can specialize to types that implement the Float trait, which as far as I understand is not currently possible.
Alternatively, we could have a list of "common" float wrappers and specialize for all of them, but it doesn't sound very scalable.
Re. 3 - Ok, I'll work on it and add them to the PR.
Running these benchmarks I get:
name master ns/iter branch ns/iter diff ns/iter diff % speedup
contiguous_sum_1e2 16 16 0 0.00% x 1.00
contiguous_sum_1e4 1,223 1,415 192 15.70% x 0.86
contiguous_sum_1e7 3,956,575 4,089,350 132,775 3.36% x 0.97
contiguous_sum_int_1e4 1,568 1,938 370 23.60% x 0.81
sum_by_col_1e4 297,202 290,586 -6,616 -2.23% x 1.02
sum_by_row_1e4 431,325 425,512 -5,813 -1.35% x 1.01
Never done serious benchmarking before, so take these results with a grain of salt.
@LukeMathWalker Thanks for working on this! Pairwise summation is better than the current implementation.
I would hope we'd be able to use pairwise summation without significant performance cost. I'm surprised that the differences in some of those benchmarks are as large as they are.
A few thoughts:
-
I think we should use pairwise summation as the implementation of
.sum(). It's the default for NumPy and Julia, and it seems like a clear win if the performance cost is small. -
We can specialize on integer types to avoid pairwise summation for them if that turns out to provide a significant performance benefit. However, the general case should be pairwise, because as @LukeMathWalker pointed out, there are many types other than
f32andf64that need the pairwise behavior for accuracy. -
It's unsatisfying to add an
array_pairwise_sumfunction just forsum_axis. How about something like the example below instead? I would hope the performance would be similar.pub fn sum_axis(&self, axis: Axis) -> Array<A, D::Smaller> where A: Clone + Zero + Add<Output = A>, D: RemoveAxis, { let mut out = Array::zeros(self.dim.remove_axis(axis)); Zip::from(&mut out) .and(self.lanes(axis)) .apply(|out, lane| *out = lane.sum()); out } -
We should make sure that
pairwise_sumanditerator_pairwise_sumadd up the same number of values in the base case. In the current version of this PR, they don't, sincepairwise_sumcallsunrolled_foldwhich performs an eightfold unrolled sum. To make surepairwise_sumanditerator_pairwise_sumadd up the same number of values in the base case,iterator_pairwise_sumneeds to divideNAIVE_SUM_THRESHOLDby 8. -
It would be nice for
.product()to be pairwise too. We could add a.pairwise_reduce()method forArrayBasethat would assume that the operationfis associative and operate in a pairwise way, and then use that method to implement both.sum()and.product(). Assuming the compiler inlines sufficiently aggressively, this shouldn't have a performance cost.impl<A, S, D> ArrayBase<S, D> where S: RawData<Elem = A>, D: Dimension, { fn pairwise_reduce(&self, base: usize, id: A, f: F) -> A where F: Fn(A, A) -> A, A: Clone, {} }
We can specialize on integer types to avoid pairwise summation for them if that turns out to provide a significant performance benefit. However, the general case should be pairwise, because as @LukeMathWalker pointed out, there are many types other than f32 and f64 that need the pairwise behavior for accuracy.
We will still fall short on new types wrapping integer values, but this seems like a better solution. If we plan on specializing, we might also take advantage of the fact that, depending on the byte size, we might get more integers packed together using SIMD compared to floats. But I guess this falls under the broader discussion on how to leverage SIMD instructions in ndarray.
It's unsatisfying to add an array_pairwise_sum function just for sum_axis. How about something like the example below instead? I would hope the performance would be similar.
I tried it out and this is the performance I get:
name master ns/iter branch ns/iter diff ns/iter diff % speedup
sum_by_col_1e4 287,381 303,410 16,029 5.58% x 0.95
sum_by_row_1e4 420,978 1,019,645 598,667 142.21% x 0.41
We should make sure that pairwise_sum and iterator_pairwise_sum add up the same number of values in the base case. In the current version of this PR, they don't, since pairwise_sum calls unrolled_fold which performs an eightfold unrolled sum. To make sure pairwise_sum and iterator_pairwise_sum add up the same number of values in the base case, iterator_pairwise_sum needs to divide NAIVE_SUM_THRESHOLD by 8.
Good catch - I'll fix it.
It would be nice for .product() to be pairwise too.
Would it really be beneficial? When doing long products what I usually fear is overflowing/underflowing. To get around that, I use the exp-ln trick, as in https://github.com/jturner314/ndarray-stats/pull/20 for geometric_mean.
I took another look at .sum_axis() and decided to benchmark this time instead of just guessing. :) This has slightly better performance on my machine than the PR for the sum_by_col_1e4 and sum_by_row_1e4 benchmarks and is more concise:
pub fn sum_axis(&self, axis: Axis) -> Array<A, D::Smaller>
where A: Clone + Zero + Add<Output=A>,
D: RemoveAxis,
{
let n = self.len_of(axis);
let stride = self.strides()[axis.index()];
if self.ndim() == 2 && stride == 1 {
// contiguous along the axis we are summing
let mut res = Array::zeros(self.raw_dim().remove_axis(axis));
let ax = axis.index();
for (i, elt) in enumerate(&mut res) {
*elt = self.index_axis(Axis(1 - ax), i).sum();
}
res
} else if self.len_of(axis) <= numeric_util::NO_SIMD_NAIVE_SUM_THRESHOLD {
self.fold_axis(axis, A::zero(), |acc, x| acc.clone() + x.clone())
} else {
let (v1, v2) = self.view().split_at(axis, n / 2);
v1.sum_axis(axis) + v2.sum_axis(axis)
}
}
To make sure pairwise_sum and iterator_pairwise_sum add up the same number of values in the base case, iterator_pairwise_sum needs to divide NAIVE_SUM_THRESHOLD by 8.
Actually, there's another issue I didn't catch earlier. In the base case, pairwise_sum adds 64 values, and then for higher levels, adds pairs of sums. In contrast, in the base case, iterator_pairwise_sum adds 64 values (computing partial_sums), then in the next higher level adds groups of 64 sums (in the pairwise_sum call), and then for the higher levels adds pairs of sums. In other words, iterator_pairwise_sum has two levels of summing groups of 64 values, not just the base level, so it needs to be changed to match pairwise_sum.
I also noticed that we technically need to implement unrolled_fold pairwise with something like this:
pub fn unrolled_fold<A, I, F>(mut xs: &[A], init: I, f: F) -> A
where A: Clone,
I: Fn() -> A,
F: Fn(A, A) -> A,
{
// eightfold unrolled so that floating point can be vectorized
// (even with strict floating point accuracy semantics)
let (mut p0, mut p1, mut p2, mut p3,
mut p4, mut p5, mut p6, mut p7) =
(init(), init(), init(), init(),
init(), init(), init(), init());
while xs.len() >= 8 {
p0 = f(p0, xs[0].clone());
p1 = f(p1, xs[1].clone());
p2 = f(p2, xs[2].clone());
p3 = f(p3, xs[3].clone());
p4 = f(p4, xs[4].clone());
p5 = f(p5, xs[5].clone());
p6 = f(p6, xs[6].clone());
p7 = f(p7, xs[7].clone());
xs = &xs[8..];
}
let (q0, q1, q2, q3) = (f(p0, p4), f(p1, p5), f(p2, p6), f(p3, p7));
let (r0, r1) = (f(q0, q2), f(q1, q3));
let unrolled = f(r0, r1);
// make it clear to the optimizer that this loop is short
// and can not be autovectorized.
let mut partial = init();
for i in 0..xs.len() {
if i >= 7 { break; }
partial = f(partial.clone(), xs[i].clone())
}
f(unrolled, partial)
}
The performance difference is within the +/- bounds of the benchmarks.
By the way, after more thought, I'd prefer to rename NO_SIMD_NAIVE_SUM_THRESHOLD to NAIVE_SUM_THRESHOLD because that's really the base case, and add a constant UNROLL_ACCUMULATORS or UNROLL_DEGREE of 8. Then, the condition for pairwise_sum would be n <= NAIVE_SUM_THRESHOLD * UNROLL_ACCUMULATORS.
It would be nice for .product() to be pairwise too.
Would it really be beneficial?
I assumed that floating-point multiplication would have the same issue with error accumulation as addition, but I don't really know. We can keep .product() as-is for now.
Ok, I'll make the required changes to have a uniform behaviour.
One observation: out of curiosity, I have repeated the benchmarks on a different machine (my desktop has a AMD Ryzen 1700, while this time I am running them on my office laptop, which mounts an Intel i9-8950HK).
Laptop (Intel):
name master ns/iter branch ns/iter diff ns/iter diff % speedup
contiguous_sum_1e2 16 16 0 0.00% x 1.00
contiguous_sum_1e4 1,355 1,435 80 5.90% x 0.94
contiguous_sum_1e7 3,433,483 3,489,613 56,130 1.63% x 0.98
contiguous_sum_int_1e4 2,338 2,294 -44 -1.88% x 1.02
sum_by_col_1e4 213,243 204,285 -8,958 -4.20% x 1.04
sum_by_row_1e4 326,246 326,854 608 0.19% x 1.00
Desktop (AMD): [for ease of comparison]
name master ns/iter branch ns/iter diff ns/iter diff % speedup
contiguous_sum_1e2 16 16 0 0.00% x 1.00
contiguous_sum_1e4 1,223 1,415 192 15.70% x 0.86
contiguous_sum_1e7 3,956,575 4,089,350 132,775 3.36% x 0.97
contiguous_sum_int_1e4 1,568 1,938 370 23.60% x 0.81
sum_by_col_1e4 297,202 290,586 -6,616 -2.23% x 1.02
sum_by_row_1e4 431,325 425,512 -5,813 -1.35% x 1.01
Results look quite different - there is little to no impact on performance. In particular, there seems to be no effect on the speed of integer summation.
I addressed all your observations @jturner314 :+1:
Hey @LukeMathWalker, it's been a while since I've commented on GitHub, and I just wanted to let you know that I'm not ignoring your PRs on ndarray and ndarray-stats; they look like great functionality. Life is just really busy, and I don't have time to review them right now. I'm hoping things will calm down by next weekend or so.
No need to justify: all of us are doing OSS out of our free time and sometimes life just crashes that free time (or the will to do anything with it). No pressure on this side @jturner314!
I noticed one more issue -- in the non-contiguous case, the current .sum() implementation in this PR operates pairwise only over the last dimension (since it's naively adding up the pairwise sums of the lanes). I've created LukeMathWalker/ndarray#3 to fix this and make some additional simplifications and performance improvements.
The only other things I'd like to see are:
-
A few more tests of
.sum()/.sum_axis(). In particular, I'd like to see a test with more than 2 dimensions, a test with axis lengths that are larger thanNAIVE_SUM_THRESHOLD * UNROLL_SIZEbut not evenly divisible byNAIVE_SUM_THRESHOLDorUNROLL_SIZE, and the same for a discontiguous array. -
Some benchmarks with integer elements to see whether we need to specialize the implementation for integers.
Edit: I noticed that LukeMathWalker/ndarray#3 causes a performance regression in the middle_discontiguous_sum_ix3_1e2 benchmark since sum no longer takes advantage of pairwise_sum when the last axis is contiguous. I have a fix for this, but it needs some cleanup before I push it.
I looked at https://github.com/LukeMathWalker/ndarray/pull/3 and it seems good to me - I'll merge it and in the meanwhile you can polish the edits required to revert the performance regression you have observed. I'll work on getting more tests in there and some significant benchmarks with integers.
The only thing about https://github.com/LukeMathWalker/ndarray/pull/3 that I was slightly confused about is:
let cap = len.saturating_sub(1) / NAIVE_SUM_THRESHOLD + 1; // ceiling of division
Why are we subtracting 1 from len? :thinking:
Re-running benchmarks:
Desktop (AMD):
name master ns/iter branch ns/iter diff ns/iter diff % speedup
clip 1,730 1,658 -72 -4.16% x 1.04
contiguous_sum_1e2 16 14 -2 -12.50% x 1.14
contiguous_sum_1e4 1,277 1,479 202 15.82% x 0.86
contiguous_sum_1e7 4,057,990 4,187,331 129,341 3.19% x 0.97
contiguous_sum_int_1e2 17 17 0 0.00% x 1.00
contiguous_sum_int_1e4 1,619 2,013 394 24.34% x 0.80
contiguous_sum_int_1e7 4,993,451 5,037,513 44,062 0.88% x 0.99
contiguous_sum_int_ix3_1e2 326,134 333,641 7,507 2.30% x 0.98
contiguous_sum_ix3_1e2 283,887 289,471 5,584 1.97% x 0.98
inner_discontiguous_sum_int_ix3_1e2 809,916 1,164,653 354,737 43.80% x 0.70
inner_discontiguous_sum_ix3_1e2 809,432 1,014,721 205,289 25.36% x 0.80
middle_discontiguous_sum_int_ix3_1e2 1,344,886 2,116,789 771,903 57.40% x 0.64
middle_discontiguous_sum_ix3_1e2 1,072,775 1,987,436 914,661 85.26% x 0.54
sum_by_col_1e4 39,772,711 40,890,980 1,118,269 2.81% x 0.97
sum_by_col_int_1e4 51,297,923 51,720,879 422,956 0.82% x 0.99
sum_by_middle_1e2 659,063 682,702 23,639 3.59% x 0.97
sum_by_middle_int_1e2 645,698 663,506 17,808 2.76% x 0.97
sum_by_row_1e4 54,742,936 56,152,978 1,410,042 2.58% x 0.97
sum_by_row_int_1e4 42,806,355 44,406,731 1,600,376 3.74% x 0.96
Laptop (Intel):
name master ns/iter branch ns/iter diff ns/iter diff % speedup
clip 1,212 1,215 3 0.25% x 1.00
contiguous_sum_1e2 15 12 -3 -20.00% x 1.25
contiguous_sum_1e4 1,236 1,325 89 7.20% x 0.93
contiguous_sum_1e7 3,170,928 3,425,280 254,352 8.02% x 0.93
contiguous_sum_int_1e2 13 12 -1 -7.69% x 1.08
contiguous_sum_int_1e4 2,227 2,196 -31 -1.39% x 1.01
contiguous_sum_int_1e7 4,156,062 4,232,262 76,200 1.83% x 0.98
contiguous_sum_int_ix3_1e2 252,058 266,392 14,334 5.69% x 0.95
contiguous_sum_ix3_1e2 166,770 183,237 16,467 9.87% x 0.91
inner_discontiguous_sum_int_ix3_1e2 543,743 883,764 340,021 62.53% x 0.62
inner_discontiguous_sum_ix3_1e2 672,002 880,811 208,809 31.07% x 0.76
middle_discontiguous_sum_int_ix3_1e2 549,559 903,988 354,429 64.49% x 0.61
middle_discontiguous_sum_ix3_1e2 441,214 989,728 548,514 124.32% x 0.45
sum_by_col_1e4 31,920,821 34,300,217 2,379,396 7.45% x 0.93
sum_by_col_int_1e4 42,802,478 43,464,495 662,017 1.55% x 0.98
sum_by_middle_1e2 385,995 399,979 13,984 3.62% x 0.97
sum_by_middle_int_1e2 398,241 402,326 4,085 1.03% x 0.99
sum_by_row_1e4 43,487,940 45,690,264 2,202,324 5.06% x 0.95
sum_by_row_int_1e4 42,770,556 44,198,493 1,427,937 3.34% x 0.97
Okay, I've added my changes in LukeMathWalker/ndarray#4. They should improve performance on the middle_discontiguous benchmarks.
The only thing about LukeMathWalker#3 that I was slightly confused about is:
let cap = len.saturating_sub(1) / NAIVE_SUM_THRESHOLD + 1; // ceiling of divisionWhy are we subtracting 1 from
len?
The goal of this line is to determine the ceiling of len / NAIVE_SUM_THRESHOLD (see Stack Overflow). (We need the ceiling to account for the last few elements in the iterator if len is not evenly divisible by NAIVE_SUM_THRESHOLD.) I've rewritten this line in LukeMathWalker/ndarray#4 to be more clear (and to give zero when len == 0):
let cap = len / NAIVE_SUM_THRESHOLD + if len % NAIVE_SUM_THRESHOLD != 0 { 1 } else { 0 };
The benchmarks are interesting, especially the difference between the two machines. All the results look favorable except for the discontiguous cases. I'm not really sure why they're so much worse. I've tried running perf for profiling but haven't had any success (perf report keeps giving an error message). I've tried a few ideas to improve iterator_pairwise_sum and pure_pairwise_sum but haven't had any success.
The new formulation for cap is much clearer, thanks!
I have reran the benchmarks with your latest changes - the inner discontinuous case is the only one that is clearly suffering right now.
Desktop (AMD):
name master ns/iter branch ns/iter diff ns/iter diff % speedup
clip 1,651 1,650 -1 -0.06% x 1.00
contiguous_sum_1e2 16 13 -3 -18.75% x 1.23
contiguous_sum_1e4 1,224 1,385 161 13.15% x 0.88
contiguous_sum_1e7 3,869,448 4,103,742 234,294 6.05% x 0.94
contiguous_sum_int_1e2 17 15 -2 -11.76% x 1.13
contiguous_sum_int_1e4 1,568 1,931 363 23.15% x 0.81
contiguous_sum_int_1e7 4,876,993 5,036,355 159,362 3.27% x 0.97
contiguous_sum_int_ix3_1e2 332,453 341,591 9,138 2.75% x 0.97
contiguous_sum_ix3_1e2 274,877 293,840 18,963 6.90% x 0.94
inner_discontiguous_sum_int_ix3_1e2 772,709 1,116,399 343,690 44.48% x 0.69
inner_discontiguous_sum_ix3_1e2 762,707 1,013,607 250,900 32.90% x 0.75
middle_discontiguous_sum_int_ix3_1e2 1,162,295 1,425,353 263,058 22.63% x 0.82
middle_discontiguous_sum_ix3_1e2 934,716 1,136,989 202,273 21.64% x 0.82
sum_by_col_1e4 38,968,247 40,410,507 1,442,260 3.70% x 0.96
sum_by_col_int_1e4 50,080,514 51,510,736 1,430,222 2.86% x 0.97
sum_by_middle_1e2 703,151 662,579 -40,572 -5.77% x 1.06
sum_by_middle_int_1e2 637,496 644,015 6,519 1.02% x 0.99
sum_by_row_1e4 54,319,725 55,886,431 1,566,706 2.88% x 0.97
sum_by_row_int_1e4 42,122,728 43,521,191 1,398,463 3.32% x 0.97
(Intel benchmarks coming later)
What's the status of this PR?