raster-benchmark icon indicating copy to clipboard operation
raster-benchmark copied to clipboard

[WIP] add Rasters.jl benchmarks

Open asinghvi17 opened this issue 1 year ago • 16 comments

This PR adds benchmarks for Rasters.jl, and adds a pixi environment file to manage the installations of R, Python, Julia, and packages.

I added the pixi file because there's a version conflict if you try to install exactextractr and stars in the same environment, they require different versions of gdal. So now each package has a separate environment.

The way to run this is still that you just run run-benchmarks.

download-3

asinghvi17 avatar Sep 23 '24 20:09 asinghvi17

should I add https://github.com/kadyb/raster-benchmark/issues/15 too while I'm at it @kadyb?

asinghvi17 avatar Sep 24 '24 21:09 asinghvi17

Edited the first post now that the PR is functionally complete from the Rasters.jl end.

asinghvi17 avatar Sep 24 '24 22:09 asinghvi17

should I add #15 too while I'm at it @kadyb?

@asinghvi17, let's wait with this, because it uses a different dataset. Separate issue is that depending on the dataset, {fasterize} can be significantly slower than GDAL.

I have another question -- In task "zonal" are exact zone statistics like in {exactextractr} used or approximate like in the other packages?

kadyb avatar Sep 25 '24 10:09 kadyb

Its approximate like the other packages. Now I've seen exactextractr is exact I'll add that option in a few months!

But I think it will be even faster after that in most cases because it would be best with switching to an online stats approach (at least where that is possible for sum/prod/mean etc).

Currently zonal is Rasters.jl is just a nice but very basic shortcut for applying a function to the result of mask and crop over each geometry - its the rasterization machinery under mask that is fast.

So for #15 it would be really nice to have mask and rasterize benchmarks here too from my perspective!

(It would also be good for rasterize to have a range of datasets with different kinds of geometries with varying node densities and target raster resolutions to get a clear picture of the tradroffs. I also think in some cases fasterize will be close to Rasters.jl and others slower.

Another thing is there are things Rasters.jl and gdal can do that fasterize can't do, and actually a lot Rasters can do that gdal can't do either - like arbitrary functions (even functions like median that need to sort) and custom objects/number types. it would be good to cover some of those - at least things gdal can do that fasterize cant)

rafaqz avatar Sep 25 '24 10:09 rafaqz

FWIW: Ideally, it would be useful to compare the performance of the current GDAL algorithm and scan line algorithm at the C++ level, since what I presented #15 is quite limited. If the latter algorithm turned out to be significantly better, it would be worth implementing it directly in GDAL so that all packages could benefit from it. And as you mentioned, {fasterize} has some limitations, e.g. it only works with {raster} objects and polygons, and fewer options compared to GDAL. Here is related issue in the GDAL repository: https://github.com/OSGeo/gdal/issues/7200

CC: @mdsumner because is also involved in this topic.

kadyb avatar Sep 25 '24 12:09 kadyb

I was working on fasterize today ...

Btw, see this related effort here, and discussion in six hours from now:

https://github.com/developmentseed/warp-resample-profiling

https://discourse.pangeo.io/t/pangeo-showcase-geospatial-reprojection-in-python-2024-whats-available-and-whats-next/4531

mdsumner avatar Sep 25 '24 13:09 mdsumner

Well, Rasters wont benefit from faster gdal rasterize as we only use gdal for i/o and for gdalwarp. But it would be good to have detailed comparisons of these algorithms. I put a bunch of work into optimising the scanline in Rasters but there will be places it will be slower than fasterize - a few nice diagrams of the performance space would really help understand where things are at.

rafaqz avatar Sep 25 '24 13:09 rafaqz

I almost forgot about this, but for exact zonal statistics we have GO.coverage that is an efficient way to get area of a rectangle that a polygon covers. I think the rectangle there has to be axis aligned, though, which may present a problem for affine spaces or matrix lookups.

asinghvi17 avatar Sep 26 '24 16:09 asinghvi17

I imagined we can do that in the line burning phase and get coverage for each pixel the line touches instead of burning.

But zonal is doing a lot very fast, so it needs to be very much tuned to purpose to not end up with an order of magnitude or 2 slowdown.

We also have subsampling coverage in Rasters

rafaqz avatar Sep 26 '24 16:09 rafaqz

I noticed there is mistake in the title. Instead of "Benchmark vector operations" it should be "Benchmark raster operations".

Also here: https://github.com/rafaqz/Rasters.jl/blob/main/README.md#performance

kadyb avatar Feb 05 '25 12:02 kadyb

Yeah I think @asinghvi17 has reused the Makie.jl code from the vector benchmarks

rafaqz avatar Feb 05 '25 14:02 rafaqz

@asinghvi17 might also be nice to call it "Rasters.jl" in the label rather than "rasters_jl" ? People I have shown were confused by that

rafaqz avatar Feb 05 '25 14:02 rafaqz

New benchmark image

download-3

asinghvi17 avatar Feb 12 '25 23:02 asinghvi17

Platform Info: OS: Linux (x86_64-linux-gnu) CPU: 96 × Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz RAM: 16 GB

asinghvi17 avatar Feb 13 '25 00:02 asinghvi17

Kinda weird that the other timings got worse, I'm not sure what changed if anything (nvdi especially, that's just a broadcast)

And aggregate was far faster than resample but maybe we were benchmarking nearest rather than mean, so some work to do on mean

rafaqz avatar Feb 13 '25 11:02 rafaqz

Aggregate should still be way faster. Are we sure this is on the latest Rasters release?

tiemvanderdeure avatar Apr 22 '25 12:04 tiemvanderdeure

@kadyb Are you interested in this PR, now that you also got pixi up and running?

evetion avatar Sep 17 '25 07:09 evetion

Yeah its not a WIP anymore either @asinghvi17

Julia has been at the top of nearly all the vector and raster benchmarks for a year now and nothing is public.

Should we make PRs to update the readme too?

rafaqz avatar Sep 17 '25 12:09 rafaqz

We should probably also star and explain the crop time because we had to slow it down for Julia... it only take a few nanoseconds so it breaks the scale of the graph

rafaqz avatar Sep 17 '25 12:09 rafaqz

Yeah that sounds good...the plotting should also work but I need to update some stuff in dependent libs to get the latest versions

asinghvi17 avatar Sep 17 '25 14:09 asinghvi17

Gentlemen, it's time to merge this! Thank you very much for your contributions!

What are the next steps: First, I will run vector benchmarks. Then, raster benchmarks, but before that, we need to make sure that we are comparing exactly the same things in the same way and the results are identical.

kadyb avatar Sep 17 '25 19:09 kadyb

We should probably also star and explain the crop time because we had to slow it down for Julia... it only take a few nanoseconds so it breaks the scale of the graph

In Rasters.jl, is a new object created after cropping or is it some kind of in-place operation like terra::window()? I'm referring in particular to this description:

This is similar to crop without actually creating a new dataset.

kadyb avatar Sep 17 '25 19:09 kadyb

It sounds to me more like Terra.window. if you change a value in the output of crop it will also change in the source array.

If the point is to test read time there are other ways (just ras[X(1..2), Y(1..2)])

asinghvi17 avatar Sep 17 '25 19:09 asinghvi17

So it sounds like two different operations, which is why there are differences in performance. I believe that in this case, to ensure an identical comparison, in terra we should use in-place window setting: https://rspatial.github.io/terra/reference/inplace.html?

The purpose of this test is to check how fast a multiband raster is cropped to bounding box, but by default packages create a new raster as the output, which is why this operation is slower than rasters.jl?

kadyb avatar Sep 17 '25 20:09 kadyb

Yeah, that sounds like a good plan. In place is probably the thing you want to do here as a user.

asinghvi17 avatar Sep 17 '25 20:09 asinghvi17

I see two issues here:

  1. In-place operations are not supported by some packages, such as stars, raster and rasterio.
  2. I think that for average users, in-place operations are dangerous. Modifying part of the data also modifies the original data, which may go unnoticed by the user and affect the analysis results. Perhaps it would be more fair not to test in-place modifications?

kadyb avatar Sep 17 '25 20:09 kadyb