raster-benchmark
raster-benchmark copied to clipboard
rasters_jl issues
Extent of the input rasters (xmax and ymax are different):
In terra:
ext(ras)
#> SpatExtent : 550785, 783915, 5611185, 5847315 (xmin, xmax, ymin, ymax)
In rasters_jl:
extent(rast)
#> Extent(X = (550785.0, 783885.0), Y = (5.611215e6, 5.847315e6), Band = (:B1, :B9))
In crop.jl:
rast = Raster(RasterStack(raster_files; name=band_names, lazy=true))
x = crop(rast; to=ext)
extent(x)
#> Extent(X = (598515.0, 727485.0), Y = (5.682105e6, 5.780985e6))
xmax and ymax are different, because rasters.jl use a different option for snapping?
Extent in terra:
x = crop(ras, area, snap = "near")
ext(x)
SpatExtent : 598515, 727515, 5682105, 5781015 (xmin, xmax, ymin, ymax)
- The cropped raster should be a deep copy in
rasters.jl.
In terra:
x = aggregate(ras, fact = 3, fun = "mean")
global(x[[1]], "mean", na.rm = TRUE)
#> 9767.289
x[1000,1000,1]
#> 10125.44
In rasters_jl:
x = Rasters.aggregate(mean, raster, (3))
mean(skipmissing(x[:,:,1]))
#> 2484.44657576314
x[1000,1000,1]
#> 2843.6666666666665
The number of rows and columns is also different.
extract-points.jl: Everything is missing. Output should have values.
@NamedTuple{geometry::Union{Missing, Tuple{Float64, Float64}}, B1::Union{Missing, UInt16}, B10::Union{Missing, UInt16}, B11::Union{Missing, UInt16}, B2::Union{Missing, UInt16}, B3::Union{Missing, UInt16}, B4::Union{Missing, UInt16}, B5::Union{Missing, UInt16}, B6::Union{Missing, UInt16}, B7::Union{Missing, UInt16}, B9::Union{Missing, UInt16}}(((668900.0, 5.69258e6), missing, missing, missing, missing, missing, missing, missing, missing, missing, missing))
In terra:
x = lapp(ras, fun = ndvi)
global(x, mean, na.rm = TRUE)
#> 0.285815
In rasters_jl:
x = get_ndvi(red, nir)
mean(skipmissing(x))
#> 0.331073887918443
In write.jl NoData Value should be set to 0, not 65535.
In terra the output is saved to the temporary file.
Why are the results in raster.jl different?
terra::extract(ras, buffers, fun = mean, exact = FALSE)[, -1]
#> B1 B10 B11 B2 B3 B4 B5 B6 B7 B9
#> 1 9232.815 24792.09 23200.88 8417.715 7710.402 7387.806 11909.14 10570.8 8245.231 5051.213
terra::extract(ras, buffers, fun = mean, exact = TRUE)[, -1]
#> B1 B10 B11 B2 B3 B4 B5 B6 B7 B9
#> 1 9234.606 24794.23 23202.35 8420.676 7714.726 7395.761 11913.44 10587.64 8257.126 5051.198
exactextractr::exact_extract(ras, buffers, fun = "mean")
#> B1 B10 B11 B2 B3 B4 B5 B6 B7 B9
#> 1 9234.605 24794.23 23202.35 8420.676 7714.726 7395.76 11913.44 10587.64 8257.126 5051.198
zonal((Statistics.mean), rstack; of=(buffer_df.geom), progress=false)
#> (B1 = 9242.757575757576, B10 = 24795.405594405594, B11 = 23204.291375291374, B2 = 8432.946386946387, B3 = 7730.361305361305, B4 = 7427.223776223776, B5 = 11921.039627039627, B6 = 10657.249417249417, B7 = 8306.972027972028, B9 = 5051.146853146854)
Hey @asinghvi17 and @rafaqz! I ran the scripts in rasters_jl and compared the output with terra and noticed a lot of differences. Could you please check this and respond?
Thanks for the ping. I can't run Terra on my Mac (for some reason it doesn't install) so it is great to have the results to compare to here.
For crop, what are you trying to measure? If its data loading time then let's call this something else maybe?
For the dataframe comments, julia accepts any vector of named tuples as a table, so this is completely compatible with the interface.
The differences in results are definitely something we should dig in to. I'll try to replicate those tomorrow.
Thanks!
For crop, what are you trying to measure? If its data loading time then let's call this something else maybe?
I actually wrote "load", but what I mean is that after raster cropping, the output object should be a new object in memory like a deep copy (not an in-place operation). To make it comparable with other packages.
For the dataframe comments, julia accepts any vector of named tuples as a table, so this is completely compatible with the interface.
I mean that the output should be displayed as table/data frame (rows and columns).
Thanks, I will need to look through the files to see why they are different for aggregate. The extent looks like a points/intervals disagreement. nvdi may just be an incorrect implementation of the benchmark.
We can set missingval=0 in write to make that the same.
And we can add a materialisation step after crop if you want, it will at least stop that benchmark from ruining the graph as our crop is essentially free. Its designed that way to allow it's use in larger than memory work.
And yes in Julia we have the Tables.jl standard rather than a specific inbuilt dataframe object. Anything Tables.jl compatible is essentially equivalent to all other Tables jl compatible packages so it is normal to let the conversion happen at user discretion (Rasters.jl does not depend on DataFrames.jl, only Tables.jl). The Vector of NamedTuple we use is the simplest and often fastest compatible table.
In real extract workflows we don't usually use DataFrames.jl, as that's only useful if you need to use use some of its joins or filters. If you are going to write to disk it's just adding a redundant step. Running DataFrame(output) after these methods run is kind of benchmarking the DataFrames.jl package rather than Rasters.jl, but if you want to enforce DataFrame conversions that's basically all we have to add.
But it may be missing the point that the ability to use trait systems like Tables.jl are a key performance advantage of Julia over other languages. We design everything in Rasters.jl from within that context, where R packages operate in a "dataframe is core object" paradigm.
Thanks! If you have any fixes, please let me know and I will run it again.
I'm unsure about the output structure for point extraction and zonal statistics tasks, because as you mentioned, in the R ecosystem it relies heavily on data frames. In the case of Python packages, I convert to pandas data frame in the scripts, so I would like the result to look the same everywhere. Can you return a table in Tables.jl that looks like data frame in R and pandas?
We can return that, but it's possible the conversion to DataFarme will double the run time!
Rasters.extract is super optimised to the point that the cost of allocating the output memory dominates the operation. Possibly DataFrames.jl can do a no-copy wrapper around our table? I will have to explore that.
From what I can see, converting to data frame has no impact:
julia> @be zonal($(Statistics.mean), $rstack; of=$(buffer_df.geom), progress=false) seconds = 10
Benchmark: 76 samples with 1 evaluation
min 122.162 ms (307961 allocs: 109.233 MiB)
median 133.007 ms (307961 allocs: 109.233 MiB, 6.57% gc time)
mean 132.404 ms (307961 allocs: 109.233 MiB, 5.77% gc time)
max 136.647 ms (307961 allocs: 109.233 MiB, 8.52% gc time)
julia> @be DataFrame(zonal($(Statistics.mean), $rstack; of=$(buffer_df.geom), progress=false)) seconds = 10
Benchmark: 76 samples with 1 evaluation
min 122.473 ms (308113 allocs: 109.386 MiB)
median 133.389 ms (308113 allocs: 109.387 MiB, 6.77% gc time)
mean 132.730 ms (308113 allocs: 109.387 MiB, 5.84% gc time)
max 137.825 ms (308113 allocs: 109.387 MiB, 8.17% gc time)
But what concerns me is that the first execution takes about 5 seconds, and each subsequent one takes about 0.13 second.
Setting missingval=0 in write doesn't work for me:
julia> @be write($stack_file, $ras; missingval=0, force=$true, options = $(Dict("COMPRESS" => "LZW", "INTERLEAVE" => "BAND"))) seconds=60
ERROR: GDALError (CE_Failure, code 1):
GetNoDataValue() should be called instead
Oh good to know! I guess this is for polygons not lines or points where each row is a relatively small extract cost.
But what concerns me is that the first execution takes about 5 seconds, and each subsequent one takes about 0.13 second.
Yes, this is just the Julia language. We can get better than C performance by compiling at runtime. But at the cost that anything small that you only run once may be slow because it takes a few seconds to compile.
If your idea is to do large scale work where performance matters then compile time is irrelevant, but if you are benchmarking simple scripts you would only run once yes, the first time has some effects. So it depends on what your objective is here. We care about large scales, for example, Rasters.extract was explicitly optimised to get an hours to minutes transition for tera users, as that's the sweet spot for people who would consider using Julia for spatial work.
And with missingval your 0 is an Int64, is the file Int64? GDAL handles Int64 in a different function, and Julia treats Int as different things to floats unlike R so it's more likely to cause problems. (Either way it should be a clearer error)
Ok, thanks! Maybe I should add information to the README that Julia functions are compiled and take longer to run first time. Later, during the benchmark, from what I noticed, the first execution (compilation) time is always skipped?
And with
missingvalyour0is an Int64, is the file Int64? GDAL handles Int64 in a different function, and Julia treats Int as different things to floats unlike R so it's more likely to cause problems. (Either way it should be a clearer error)
Input files are encoded as UInt16 with NoData set to 0 in the metadata.
Edit: I changed it to missingval=UInt16(0) and it works.
Yeah that sounds right for missingval. Maybe we should add a check for mismatched types to avoid calling that GDAL function that errors.
And yes that's a good note to have, quite common in Julia performance docs to flag that because it kind of "feels slow" because you get compilation on the first demo. But you only have to pay that price again if you are using different types, as in converting the raster to Float64 would trigger a bit more compilation.
Rasters.jl is aggressive about compiling the best possible kernel in most contexts, so this will happen quite a bit. But that's part of why it benchmarks as well it does.
Have you tried using the minimum or maximum instead of the mean in the zonal statistics? I get an error:
zonal((Statistics.minimum), rstack; of=(buffer_df.geom), progress=false)
#> ERROR: TaskFailedException
#> nested task error: ArgumentError: reducing over an empty collection is not allowed; consider supplying `init` to the reducer
Stacktrace
[1] _empty_reduce_error()
@ Base ./reduce.jl:319
[2] reduce_empty(f::Function, T::Type)
@ Base ./reduce.jl:320
[3] mapreduce_empty(::typeof(identity), op::Function, T::Type)
@ Base ./reduce.jl:369
[4] _mapreduce(f::Function, op::Function, ::IndexLinear, itr::Base.SkipMissing{Raster{…}})
@ Base ./missing.jl:286
[5] mapreduce
@ ./missing.jl:273 [inlined]
[6] minimum
@ ./reduce.jl:793 [inlined]
[7] _maybe_skipmissing_call
@ ~/.julia/packages/Rasters/5ShjB/src/methods/zonal.jl:150 [inlined]
[8] #1037
@ ~/.julia/packages/Rasters/5ShjB/src/methods/zonal.jl:112 [inlined]
[9] macro expansion
@ ~/.julia/packages/DimensionalData/hv9KC/src/utils.jl:238 [inlined]
[10] unrolled_map(f::Rasters.var"#1037#1039"{Bool, typeof(minimum)}, v::NTuple{10, Raster{…}})
@ DimensionalData ~/.julia/packages/DimensionalData/hv9KC/src/utils.jl:238
[11] maplayers(f::Function, s::RasterStack{…})
@ DimensionalData ~/.julia/packages/DimensionalData/hv9KC/src/stack/stack.jl:232
[12] _zonal(f::Function, st::RasterStack{…}, ::PolygonTrait, geom::GeoInterface.Wrappers.Polygon{…}; skipmissing::Bool, kw::@Kwargs{})
@ Rasters ~/.julia/packages/Rasters/5ShjB/src/methods/zonal.jl:110
[13] _zonal
@ ~/.julia/packages/Rasters/5ShjB/src/methods/zonal.jl:104 [inlined]
[14] _zonal
@ ~/.julia/packages/Rasters/5ShjB/src/methods/zonal.jl:90 [inlined]
[15] (::Rasters.var"#1041#1042"{@Kwargs{…}, typeof(minimum), RasterStack{…}, Vector{…}, Vector{…}})(i::Int64)
@ Rasters ~/.julia/packages/Rasters/5ShjB/src/methods/zonal.jl:124
[16] macro expansion
@ ~/.julia/packages/Rasters/5ShjB/src/utils.jl:466 [inlined]
[17] (::Rasters.var"#328#threadsfor_fun#240"{Rasters.var"#328#threadsfor_fun#239#241"{…}})(tid::Int64; onethread::Bool)
@ Rasters ./threadingconstructs.jl:253
[18] #328#threadsfor_fun
@ ./threadingconstructs.jl:220 [inlined]
[19] (::Base.Threads.var"#1#2"{Rasters.var"#328#threadsfor_fun#240"{Rasters.var"#328#threadsfor_fun#239#241"{…}}, Int64})()
@ Base.Threads ./threadingconstructs.jl:154
That would mean that there was at least one geometry either completely outside the raster, or for which the masked raster was all NA/missing. Then you are taking minimum([]) which is not supported. But that doesn't explain why mean works...will have to try that out locally
What is surprising is that the mean works, but the minimum/maximum doesn't. This problem may be partly related to the smaller extent of thermal bands (B10 and B11) compared to other bands. If all values are missing, then terra and exactextractr return NA for that layer, however if some values are in range, then the result is returned.
library(terra)
# thermal band (B10)
f = "data/LC08_L1TP_190024_20200418_20200822_02_T1/LC08_L1TP_190024_20200418_20200822_02_T1_B10.TIF"
r = rast(f)
buffers = vect("data/vector/buffers.gpkg")
r = classify(r, cbind(-Inf, Inf, 1))
extent = as.polygons(r, values = FALSE)
idx = which(!relate(buffers, extent, "within"))
length(idx)
#> 48 # buffers outside (or partially) of raster
library(exactextractr)
df = exact_extract(rstack, buffers, fun = "min")
df[idx, ]
#> B1 B10 B11 B2 B3 B4 B5 B6 B7 B9
#> 89 8974 24009 22677 8103 7285 6508 9770 7980 6496 5035
#> 137 9431 24440 22925 8536 7801 6950 14122 8879 6677 5031
#> 167 9034 NA NA 8156 7321 6696 10406 8483 7144 5053
#> 266 9017 NA NA 8158 7396 6558 6826 6023 5882 5048
#> 310 8891 23286 21973 8012 7097 6449 9435 7598 6246 5038
What you are seeing is just Base Julia behaviour!
The reducing function you use in zonal is actually arbitrary we just call it on an iterator, there is no special casing at all. (e.g. you can write your own functions that checks: myminimum(vals) = isempty(vals) ? missing : minimum(vals) )
We could add a keyword like emptyval to add an optional global check that the iterator has at least one value? Then you could do emptyval=missing and avoid this error.
(In base Julia/Statistics.jl mean can return NaN because it always involves a conversion to floating point anyway, and taking the mean of an empty list is 0/0 so NaN is a consistent answer.
But minimum and maximum need to return the original type of the array, and there is no default integer return value for reductions)
We could add a keyword like
emptyvalto add an optional global check that the iterator has at least one value? Then you could doemptyval=missingand avoid this error.
Yes, good idea, because there are only 48 missing values (out of almost 2,000 objects) for two layers, so the user will still get the results and it will be consistent with the behavior of the other packages.
PR is up https://github.com/rafaqz/Rasters.jl/pull/1023
Ok, for the first problem with different extents... hear me out... R and Python give you the wrong answer and Rasters.jl is correct!
As I suspected your TIF files are points rather than areas, as in the gdal metadata for "AREA_OR_POINT" is "Point".
My understanding is that GDAL and most packages built on it ignore the point/interval distinction and returns extents as if all rasters are areas (with the poijnt in the top left corner). Except gdalwarp does something different with them! So its a bit of a mess.
These decisions are also quite contentious within OSGEO developers, and considered wrong to some https://github.com/OSGeo/gdal/issues/2384. And just a design decision to others, and others consider points should always be half-shifted to be areas so they are in the middle: https://github.com/opengeospatial/ogcapi-coverages/issues/92
But you will notice your extent is not half shifted, so GDAL/terra etc is really treating that corner point as just the corner of an area, not a point at all.
Rasters.jl builds on the more generally used DimensionalData.jl that chose not to extend GDALs arbitrary design decisions into all other fields of science. DimensionalData.jl treats points and areas differently so that the extent for points will match that of an identical MultiPoint geometry holding the same points, and for areas it will match that of the pixels as a tiled MultiPolygon geometry. To me that's more principled.
But in the end its probably another design decision, and I will pay for it for a long time because of this comparison will keep coming up!
I havent checked yet, but this will also give you different zonal statistics because if you consider a point an area it is the center of that area not the edge of it. I will change the raster loading code to force considering this as areas with the point in the corner, and we should get more similar answers.
So this is quite widely disputed issue:
https://github.com/corteva/rioxarray/issues/805 https://github.com/GlacioHack/geoutils/issues/503
There is a GDAL RFC to shift the extent by half a pixel (which also claims gdal has always been wrong about points) https://gdal.org/en/stable/development/rfc/rfc33_gtiff_pixelispoint.html
(Its not clear to me where or how widely this was implemented, but from your extrent not being half shifted maybe it wasn't? or terra doesn't use it? I dont known)
~~This would make the current output of terra zonal statistics wrong, and will likely lead to the answer that Rasters.jl already gives.~~ Unlikely it changes much with this resolution
~~The difference with crop is the combination of forcing treatment as pixel area, and setting touches=true. ~~
~~We should probably make touches=true the default to standardise with R/Python~~
No its still different somehow will dig in further
The main difference in aggregate is that the type is UInt16 and we are getting overflow in the sum.
If we force Float64 the answer is the same as terra
julia> ag[1000, 1000, 1]
10125.444444444445
For mean specifically this is a bug, as the result is Float64. But for sum its a harder question... normally sum doesn't change the type so doing that here would break with the rest of the language. The idea would be to convert to Float64 first.
But I will fix mean.
Same for ndvi, its just overflow or underflow adding UInt16 in the benchmark implementation. Converting to Int64 on the fly (with multiplication by 1) fixes it:
julia> get_ndvi(red, nir) = (1 * nir .- red) ./ (1 * nir .+ red)
get_ndvi (generic function with 1 method)
julia> mean(skipmissing(get_ndvi(red, nir)))
0.2858150154175652
Thanks @rafaqz for looking into this in detail, great investigation! This is topic for the next SDSL 😃
As I suspected your TIF files are points rather than areas, as in the gdal metadata for "AREA_OR_POINT" is "Point".
I think the metadata is incorrect because this is a satellite image from Landsat, so the sampling covers an area of 30 x 30 meters (one entire cell, not single point). I downloaded this data from the USGS over 5 years ago, so maybe the metadata is corrected today, but I didn't check. GDAL 3.11 displays the following extent:
gdalinfo LC08_L1TP_190024_20200418_20200822_02_T1_B1.TIF
#> Corner Coordinates:
#> Upper Left ( 550785.000, 5847315.000) ( 15d45'10.10"E, 52d46'23.78"N)
#> Lower Left ( 550785.000, 5611185.000) ( 15d43' 6.11"E, 50d39' 1.10"N)
#> Upper Right ( 783915.000, 5847315.000) ( 19d12' 9.41"E, 52d42' 4.10"N)
#> Lower Right ( 783915.000, 5611185.000) ( 19d 0'39.44"E, 50d35' 0.33"N)
#> Center ( 667350.000, 5729250.000) ( 17d25'16.13"E, 51d41'21.24"N)
Great that you found solutions for ndvi and aggregate too. Maybe in the future you should consider automatically converting to float to avoid this kind of error (or warning)? It seems to me that the average user may not catch this and think that the output is correct.
Im bugfixing aggregate mean and sum as they should convert internally I just implemented them badly.
But NVDI is just a base Julia broadcast not Rasters.jl in really any way, and in that case you just have to know... julia doesn't hold you hand with number types because that makes everything slower. And converting the Raster eagerly to float would take 4-8x the memory for UInt8, so it's not something we would do by default.
(Some cultural differences here!)