Rasters.jl
Rasters.jl copied to clipboard
Innacurate results from `sum(skipmissing(raster))`
I was getting some mysterious results from summing over a raster of gridded population counts from WorldPop (which can be downloaded here)
This is a big raster, with Float32 data and a Float32 missingval. The output of sum(skipmissing(x)) is not identical to the sum after aggregating, or the sum of the sum of the slices, or the sum of collect(skipmissing(x))
In the worldpop data, the result is off by like 5 percent, but when I try to reproduce I can only get it off by like 0.1%. MWE:
n = 1000000
mval = -Inf32
ras = Raster(fill(mval, X(1:10000), Y(1:10000)); missingval = mval)
vals = exp.(rand(Float32, n) .* Float32(10))
ras[1:length(vals)] .= vals
sum(skipmissing(ras)) ≈ sum(vals) # false!
sum(skipmissing(replace_missing(ras))) ≈
sum(sum.(skipmissing.(eachslice(ras; dims = X)))) ≈
sum(vals) # true
What is going on here?
just to add to this a little bit more, I also noticed that doing (maybe a totally different issue, but I mention it because is also a sum 😄 ).
sum(resample(ras,...., method="sum")) and sum(ras) will give slightly different results.
Think of the sheer number of floating point operations happening in summing your 10_000 * 10_000 array. Theres a lot of room for error to creep in
There is a correctness optimisation to reduce the error in that process: https://en.wikipedia.org/wiki/Kahan_summation_algorithm
Maybe we don't hit that path with the Rasters inbuilt SkipMissingVal iterator. Or something like that.
There could be specialised methods of sum that we need to add? If someone could track down what the difference is, we can add them.
But seems we don't actually use kahan summation, but pairwise. And not on a regular iteration:
https://discourse.julialang.org/t/kahan-summation-in-sum/102723
Yes it only affects really big rasters, an error of more than 5% seemed crazy, but I also haven't been able to reproduce it.
Could you try reproducing this on the original data using AccurateArithmetic.jl? Or convert the values to Float64 before summing?
For me it was a fraction of a percent, but still big. But its a real problem, thus all the algorithms to fix it.
Try looking in Base for what sum does on SkipMissing as it seems to be better
Base has all of this machinery implemented for mapreduce on SkipMissing, which breaks the iterator up into chunks (of size 1024 by default). There actually isn't a specialization for sum.
https://github.com/JuliaLang/julia/blob/50713ee4a82eb1b5613647cd74b027315f665080/base/missing.jl#L273-L362
With Rasters.SkipMissing we hit the fallback, which is Base.mapfoldl_impl, and just adds all the values in a while loop.
It's quite a lot of machinery to add and maintain in Rasters.jl, though.
To clarify: with the example above, it's 0.1% off or so, with the population data it was way more, maybe because a few pixels have very high values. Anyway it seems important that these things are accurate.
Rasters.SkipMissing hits the same fallback as any generator, which there is already this issue about in base https://github.com/JuliaLang/julia/issues/30421
Hmm it's hard to do this with lazy iteration without essentially copying the Base implementation. One can always define Base.sum(sm::Rasters.SkipMissing) = Base.sum(collect(sm)), but that materializes sm which may not be desireable.
It also seems that sum on Base skipmissing has about the same performance as Rasters' internal skipmissingval, and generates the correct result as well. But you have to call replace_missing which is not ideal.
julia> @be sum($(Base.SkipMissing(replace_missing(ras))))
Benchmark: 3 samples with 1 evaluation
47.097 ms (5 allocs: 128 bytes)
47.157 ms (5 allocs: 128 bytes)
47.242 ms (5 allocs: 128 bytes)
julia> @be sum($(skipmissing(replace_missing(ras))))
Benchmark: 3 samples with 1 evaluation
46.943 ms (5 allocs: 128 bytes)
47.107 ms (5 allocs: 128 bytes)
47.122 ms (5 allocs: 128 bytes)
As far as I remember, in the cf branch, replace_missing will no longer be needed (@rafaqz?). If so it would probably be best to define Base.sum(sm::Rasters.SkipMissingVal) = Base.sum(Base.SkipMissing(sm.parent)), or maybe a broader dispatch on the mapreduce code.
What is the value of Rasters skipmissing (in a post-cf world) to your perspective @rafaqz @tiemvanderdeure?
I imagine a lot of people still won't use missing after that PR, like if you want Bool or NaN missingval because they're much faster in modelling. Really anything is faster than using missing... SkipMissingval is very fast for a lot of things (Your benchmark is timing the same thing twice there so we can't see that).
We cant collect, we just need to see what SkipMissing does in sum and add the method.
(We already fixed this in DiskArrays too, doing nested block sum. So it only applies to in memory data)
julia> @be sum($(skipmissing(replace_missing(ras))))
This method falls back to the base implementation. If a raster has missing as its missingval, then calling skipmissing on it returns a Base.SkipMissing, otherwise a Rasters.SkipMissing.
We cant collect, we just need to see what SkipMissing does in sum and add the method.
SkipMissing has a method for mapreduce, which is called by sum - it's in the lines I linked earlier. We could totally copy-paste that code and replace all the ismissing by _missing Of course it's not ideal, but I don't see any way to get around duplicating code.
I'm happy to duplicate.
But we would also need to do chunked DiskArrays handling inside that (will also be much faster as well as more correct on disk data).
KahanSummation.jl is also pretty light and I have a PR up that causes it to use mapreduce so presumably disklike things will be supported well. We can point users to that or default sum(x::Raster{<: AbstractFloat}) to point to sum_kbn...
I think the improvements in base are still on the way. As I understand everyone is eagerly waiting for https://github.com/JuliaLang/julia/pull/58418 and then a bunch of other PRs will get merged and everything will be better! 🌈
Yeah let's wait for that all to resolve, hopefully we don't have to do anything here