Rasters.jl icon indicating copy to clipboard operation
Rasters.jl copied to clipboard

Innacurate results from `sum(skipmissing(raster))`

Open tiemvanderdeure opened this issue 1 year ago • 15 comments

I was getting some mysterious results from summing over a raster of gridded population counts from WorldPop (which can be downloaded here)

This is a big raster, with Float32 data and a Float32 missingval. The output of sum(skipmissing(x)) is not identical to the sum after aggregating, or the sum of the sum of the slices, or the sum of collect(skipmissing(x))

In the worldpop data, the result is off by like 5 percent, but when I try to reproduce I can only get it off by like 0.1%. MWE:

n = 1000000
mval = -Inf32
ras = Raster(fill(mval, X(1:10000), Y(1:10000)); missingval = mval)
vals = exp.(rand(Float32, n) .* Float32(10))
ras[1:length(vals)] .= vals
sum(skipmissing(ras)) ≈ sum(vals) # false!
sum(skipmissing(replace_missing(ras))) ≈
sum(sum.(skipmissing.(eachslice(ras; dims = X)))) ≈
sum(vals) # true

What is going on here?

tiemvanderdeure avatar Nov 04 '24 10:11 tiemvanderdeure

just to add to this a little bit more, I also noticed that doing (maybe a totally different issue, but I mention it because is also a sum 😄 ).

sum(resample(ras,...., method="sum")) and sum(ras) will give slightly different results.

lazarusA avatar Nov 04 '24 13:11 lazarusA

Think of the sheer number of floating point operations happening in summing your 10_000 * 10_000 array. Theres a lot of room for error to creep in

There is a correctness optimisation to reduce the error in that process: https://en.wikipedia.org/wiki/Kahan_summation_algorithm

Maybe we don't hit that path with the Rasters inbuilt SkipMissingVal iterator. Or something like that.

There could be specialised methods of sum that we need to add? If someone could track down what the difference is, we can add them.

rafaqz avatar Nov 04 '24 14:11 rafaqz

But seems we don't actually use kahan summation, but pairwise. And not on a regular iteration:

https://discourse.julialang.org/t/kahan-summation-in-sum/102723

rafaqz avatar Nov 04 '24 14:11 rafaqz

Yes it only affects really big rasters, an error of more than 5% seemed crazy, but I also haven't been able to reproduce it.

tiemvanderdeure avatar Nov 04 '24 15:11 tiemvanderdeure

Could you try reproducing this on the original data using AccurateArithmetic.jl? Or convert the values to Float64 before summing?

asinghvi17 avatar Nov 04 '24 15:11 asinghvi17

For me it was a fraction of a percent, but still big. But its a real problem, thus all the algorithms to fix it.

Try looking in Base for what sum does on SkipMissing as it seems to be better

rafaqz avatar Nov 04 '24 15:11 rafaqz

Base has all of this machinery implemented for mapreduce on SkipMissing, which breaks the iterator up into chunks (of size 1024 by default). There actually isn't a specialization for sum. https://github.com/JuliaLang/julia/blob/50713ee4a82eb1b5613647cd74b027315f665080/base/missing.jl#L273-L362

With Rasters.SkipMissing we hit the fallback, which is Base.mapfoldl_impl, and just adds all the values in a while loop.

It's quite a lot of machinery to add and maintain in Rasters.jl, though.

To clarify: with the example above, it's 0.1% off or so, with the population data it was way more, maybe because a few pixels have very high values. Anyway it seems important that these things are accurate.

tiemvanderdeure avatar Nov 04 '24 16:11 tiemvanderdeure

Rasters.SkipMissing hits the same fallback as any generator, which there is already this issue about in base https://github.com/JuliaLang/julia/issues/30421

tiemvanderdeure avatar Nov 04 '24 16:11 tiemvanderdeure

Hmm it's hard to do this with lazy iteration without essentially copying the Base implementation. One can always define Base.sum(sm::Rasters.SkipMissing) = Base.sum(collect(sm)), but that materializes sm which may not be desireable.

It also seems that sum on Base skipmissing has about the same performance as Rasters' internal skipmissingval, and generates the correct result as well. But you have to call replace_missing which is not ideal.

julia> @be sum($(Base.SkipMissing(replace_missing(ras))))
Benchmark: 3 samples with 1 evaluation
        47.097 ms (5 allocs: 128 bytes)
        47.157 ms (5 allocs: 128 bytes)
        47.242 ms (5 allocs: 128 bytes)

julia> @be sum($(skipmissing(replace_missing(ras))))
Benchmark: 3 samples with 1 evaluation
        46.943 ms (5 allocs: 128 bytes)
        47.107 ms (5 allocs: 128 bytes)
        47.122 ms (5 allocs: 128 bytes)

As far as I remember, in the cf branch, replace_missing will no longer be needed (@rafaqz?). If so it would probably be best to define Base.sum(sm::Rasters.SkipMissingVal) = Base.sum(Base.SkipMissing(sm.parent)), or maybe a broader dispatch on the mapreduce code.

What is the value of Rasters skipmissing (in a post-cf world) to your perspective @rafaqz @tiemvanderdeure?

asinghvi17 avatar Nov 04 '24 19:11 asinghvi17

I imagine a lot of people still won't use missing after that PR, like if you want Bool or NaN missingval because they're much faster in modelling. Really anything is faster than using missing... SkipMissingval is very fast for a lot of things (Your benchmark is timing the same thing twice there so we can't see that).

We cant collect, we just need to see what SkipMissing does in sum and add the method.

(We already fixed this in DiskArrays too, doing nested block sum. So it only applies to in memory data)

rafaqz avatar Nov 04 '24 20:11 rafaqz

julia> @be sum($(skipmissing(replace_missing(ras))))

This method falls back to the base implementation. If a raster has missing as its missingval, then calling skipmissing on it returns a Base.SkipMissing, otherwise a Rasters.SkipMissing.

We cant collect, we just need to see what SkipMissing does in sum and add the method.

SkipMissing has a method for mapreduce, which is called by sum - it's in the lines I linked earlier. We could totally copy-paste that code and replace all the ismissing by _missing Of course it's not ideal, but I don't see any way to get around duplicating code.

tiemvanderdeure avatar Nov 04 '24 21:11 tiemvanderdeure

I'm happy to duplicate.

But we would also need to do chunked DiskArrays handling inside that (will also be much faster as well as more correct on disk data).

rafaqz avatar Nov 04 '24 22:11 rafaqz

KahanSummation.jl is also pretty light and I have a PR up that causes it to use mapreduce so presumably disklike things will be supported well. We can point users to that or default sum(x::Raster{<: AbstractFloat}) to point to sum_kbn...

asinghvi17 avatar May 10 '25 16:05 asinghvi17

I think the improvements in base are still on the way. As I understand everyone is eagerly waiting for https://github.com/JuliaLang/julia/pull/58418 and then a bunch of other PRs will get merged and everything will be better! 🌈

tiemvanderdeure avatar May 21 '25 10:05 tiemvanderdeure

Yeah let's wait for that all to resolve, hopefully we don't have to do anything here

rafaqz avatar May 21 '25 10:05 rafaqz