NullableArrays.jl sum() slower than custom summing function

Despite being specialized for NullableArray, sum is slower than a loop written by hand, both on current master and on my branch improving operations on Nullable. This is most likely due to branching and to some unexpected allocations. We should check whether the compiler is able to generate optimal code when calling the operators naively, or whether specializing reduce for NullableArray can still be useful.

On a recent Julia git master, with my operations branch:

julia> using NullableArrays

julia> using BenchmarkTools

julia> x = NullableArray(rand(10000000), rand(Bool, 10000000));

julia> @benchmark sum(x, skipnull=true)
BenchmarkTools.Trial: 
  samples:          48
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  3.00 mb
  allocs estimate:  163823
  minimum time:     102.94 ms (0.00% GC)
  median time:      103.75 ms (0.00% GC)
  mean time:        104.18 ms (0.38% GC)
  maximum time:     108.24 ms (3.42% GC)

julia> function f{T}(x::NullableArray{T})
           r = Nullable(zero(T))
           @inbounds for el in x
               !isnull(el) && (r += el)
           end
           r
       end
f (generic function with 1 method)

julia> @benchmark f(x)
BenchmarkTools.Trial: 
  samples:          62
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  32.00 bytes
  allocs estimate:  1
  minimum time:     80.41 ms (0.00% GC)
  median time:      80.96 ms (0.00% GC)
  mean time:        80.98 ms (0.00% GC)
  maximum time:     81.57 ms (0.00% GC)

Jun 18 '16 22:06 nalimilan

Either this is pretty much fixed in the latest master and 0.6 or it's hardware specific. When I run your example, I get

julia> @benchmark sum(x, skipnull=true)
BenchmarkTools.Trial: 
  memory estimate:  3.00 MiB
  allocs estimate:  163822
  --------------
  minimum time:     73.413 ms (0.00% GC)
  median time:      74.191 ms (0.00% GC)
  mean time:        74.510 ms (0.28% GC)
  maximum time:     77.393 ms (0.00% GC)
  --------------
  samples:          68
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> @benchmark f(x)
BenchmarkTools.Trial: 
  memory estimate:  32 bytes
  allocs estimate:  1
  --------------
  minimum time:     71.422 ms (0.00% GC)
  median time:      71.747 ms (0.00% GC)
  mean time:        71.871 ms (0.00% GC)
  maximum time:     75.335 ms (0.00% GC)
  --------------
  samples:          70
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

However, it is still as you said if the array is mostly not missing values.

julia> bob = NullableArray(rand(10000000), [zeros(Bool, 10000000 - 1 ); true])
10000000-element NullableArrays.NullableArray{Float64,1}:
 0.907998 
 0.0410982
 0.869887 
 0.689121 
 0.347948 
 0.369682 
 0.428783 
 0.613532 
 0.490274 
 0.356395 
 ⋮        
 0.846593 
 0.506439 
 0.645877 
 0.0634062
 0.245713 
 0.644798 
 0.607075 
 0.108837 
 #NULL    

julia> @benchmark sum(bob; skipnull = true)

BenchmarkTools.Trial: 
  memory estimate:  3.00 MiB
  allocs estimate:  163822
  --------------
  minimum time:     34.577 ms (0.00% GC)
  median time:      35.300 ms (0.00% GC)
  mean time:        35.707 ms (0.69% GC)
  maximum time:     42.227 ms (5.81% GC)
  --------------
  samples:          140
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> @benchmark f(bob)
BenchmarkTools.Trial: 
  memory estimate:  32 bytes
  allocs estimate:  1
  --------------
  minimum time:     26.046 ms (0.00% GC)
  median time:      26.156 ms (0.00% GC)
  mean time:        26.830 ms (0.00% GC)
  maximum time:     33.703 ms (0.00% GC)
  --------------
  samples:          186
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

I also tried it where bob had no missing values, and then sum is much faster than the loop. It must be calling a specialized method.

Mar 27 '17 01:03 aaowens

Looks like the timings improved in 0.6. I still see a relatively small difference on 0.5.

As you not, the strategy used by _mapreduce_skipnull is still generally less efficient than the manual loop when there are few missing values. It requires going over the input once to check whether there are any nulls (missingdata): this is slow if there are no nulls among the first values. If there are nulls, it counts them, which is quite slow: actually, we don't need the exact count, we just need to know whether there are at last one or two non-null values. Finally, it goes over the input to compute the result in the general case. We should be able to replace the multiple null-checking operations with a single loop.

Mar 27 '17 08:03 nalimilan