sum() slower than custom summing function
Despite being specialized for NullableArray, sum is slower than a loop written by hand, both on current master and on my branch improving operations on Nullable. This is most likely due to branching and to some unexpected allocations. We should check whether the compiler is able to generate optimal code when calling the operators naively, or whether specializing reduce for NullableArray can still be useful.
On a recent Julia git master, with my operations branch:
julia> using NullableArrays
julia> using BenchmarkTools
julia> x = NullableArray(rand(10000000), rand(Bool, 10000000));
julia> @benchmark sum(x, skipnull=true)
BenchmarkTools.Trial:
samples: 48
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
memory estimate: 3.00 mb
allocs estimate: 163823
minimum time: 102.94 ms (0.00% GC)
median time: 103.75 ms (0.00% GC)
mean time: 104.18 ms (0.38% GC)
maximum time: 108.24 ms (3.42% GC)
julia> function f{T}(x::NullableArray{T})
r = Nullable(zero(T))
@inbounds for el in x
!isnull(el) && (r += el)
end
r
end
f (generic function with 1 method)
julia> @benchmark f(x)
BenchmarkTools.Trial:
samples: 62
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
memory estimate: 32.00 bytes
allocs estimate: 1
minimum time: 80.41 ms (0.00% GC)
median time: 80.96 ms (0.00% GC)
mean time: 80.98 ms (0.00% GC)
maximum time: 81.57 ms (0.00% GC)
Either this is pretty much fixed in the latest master and 0.6 or it's hardware specific. When I run your example, I get
julia> @benchmark sum(x, skipnull=true)
BenchmarkTools.Trial:
memory estimate: 3.00 MiB
allocs estimate: 163822
--------------
minimum time: 73.413 ms (0.00% GC)
median time: 74.191 ms (0.00% GC)
mean time: 74.510 ms (0.28% GC)
maximum time: 77.393 ms (0.00% GC)
--------------
samples: 68
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
julia> @benchmark f(x)
BenchmarkTools.Trial:
memory estimate: 32 bytes
allocs estimate: 1
--------------
minimum time: 71.422 ms (0.00% GC)
median time: 71.747 ms (0.00% GC)
mean time: 71.871 ms (0.00% GC)
maximum time: 75.335 ms (0.00% GC)
--------------
samples: 70
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
However, it is still as you said if the array is mostly not missing values.
julia> bob = NullableArray(rand(10000000), [zeros(Bool, 10000000 - 1 ); true])
10000000-element NullableArrays.NullableArray{Float64,1}:
0.907998
0.0410982
0.869887
0.689121
0.347948
0.369682
0.428783
0.613532
0.490274
0.356395
⋮
0.846593
0.506439
0.645877
0.0634062
0.245713
0.644798
0.607075
0.108837
#NULL
julia> @benchmark sum(bob; skipnull = true)
BenchmarkTools.Trial:
memory estimate: 3.00 MiB
allocs estimate: 163822
--------------
minimum time: 34.577 ms (0.00% GC)
median time: 35.300 ms (0.00% GC)
mean time: 35.707 ms (0.69% GC)
maximum time: 42.227 ms (5.81% GC)
--------------
samples: 140
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
julia> @benchmark f(bob)
BenchmarkTools.Trial:
memory estimate: 32 bytes
allocs estimate: 1
--------------
minimum time: 26.046 ms (0.00% GC)
median time: 26.156 ms (0.00% GC)
mean time: 26.830 ms (0.00% GC)
maximum time: 33.703 ms (0.00% GC)
--------------
samples: 186
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
I also tried it where bob had no missing values, and then sum is much faster than the loop. It must be calling a specialized method.
Looks like the timings improved in 0.6. I still see a relatively small difference on 0.5.
As you not, the strategy used by _mapreduce_skipnull is still generally less efficient than the manual loop when there are few missing values. It requires going over the input once to check whether there are any nulls (missingdata): this is slow if there are no nulls among the first values. If there are nulls, it counts them, which is quite slow: actually, we don't need the exact count, we just need to know whether there are at last one or two non-null values. Finally, it goes over the input to compute the result in the general case. We should be able to replace the multiple null-checking operations with a single loop.