stars aggregate could be heaps faster if polygon is small relative to raster

https://github.com/r-spatial/stars/files/3020397/PRISM_tmax_stable_4kmD1_19880611_bil.zip

The below benchmarks aggregate(stars_object, sf_polygon, mean) against the same where the stars_object is preshrunk using sf_polygon and [ with a buffer that's sufficiently large to guarantee that all overlapping grid cells are included.

library(sf)
#> Linking to GEOS 3.6.1, GDAL 2.2.3, PROJ 4.9.3
library(tidyverse)
library(stars)
#> Loading required package: abind
x1 = read_stars("D:/prism/PRISM_tmax_stable_4kmD1_19880611_bil/PRISM_tmax_stable_4kmD1_19880611_bil.bil")
nc = read_sf(system.file("shape/nc.shp", package="sf"))

nc = nc %>% st_transform(st_crs(x1)) %>% filter(CNTY_ID == 2099)

nc$geometry[[1]][[1]] <- NULL

#> although coordinates are longitude/latitude, st_intersects assumes that they are planar


bench::mark(
  # Normal one is slow
  aggregate(x1, nc, mean),
  # Fast one pre-shrinks the raster using a buffer around the polygon
  {
    x1small <- x1[nc %>% st_buffer(max(st_dimensions(x1)[[1]]$delta, st_dimensions(x1)[[2]]$delta)*2)]
    aggregate(x1small, nc, mean)
  }
)
#> # A tibble: 2 x 10
#>   expression    min    mean  median     max `itr/sec` mem_alloc  n_gc n_itr
#>   <chr>      <bch:> <bch:t> <bch:t> <bch:t>     <dbl> <bch:byt> <dbl> <int>
#> 1 aggregate~  6.49s   6.49s   6.49s   6.49s     0.154     165MB    16     1
#> 2 {...       13.3ms 15.02ms 14.28ms 19.21ms    66.6       328KB     2    34
#> # ... with 1 more variable: total_time <bch:tm>

^{Created on 2019-03-29 by the reprex package (v0.2.1.9000)}

Could aggregate itself be modified to use this strategy to speed things up?

Mar 28 '19 23:03 kendonB

I committed changes that I think do sth similar but for a stars_proxy object; could you do the benchmark again when reading x1 with proxy = TRUE?

Mar 29 '19 09:03 edzer

The operation doesn't seem to work with proxy = TRUE. It only pulls out NAs:

library(sf)
#> Warning: package 'sf' was built under R version 3.5.3
#> Linking to GEOS 3.6.1, GDAL 2.2.3, PROJ 4.9.3
library(tidyverse)
#> Warning: package 'dplyr' was built under R version 3.5.2
library(stars)
#> Loading required package: abind
#> Warning: package 'abind' was built under R version 3.5.2
x1_proxy = read_stars("PRISM_tmax_stable_4kmD1_19880611_bil/PRISM_tmax_stable_4kmD1_19880611_bil.bil", proxy = TRUE)
nc = read_sf(system.file("shape/nc.shp", package="sf"))

nc = nc %>% st_transform(st_crs(x1_proxy)) %>% filter(CNTY_ID == 2099)

nc$geometry[[1]][[1]] <- NULL

aggregate(x1_proxy, nc, mean)
#> although coordinates are longitude/latitude, st_intersects assumes that they are planar
#> Warning in mean.default(newX[, i], ...): argument is not numeric or
#> logical: returning NA
#> stars object with 1 dimensions and 1 attribute
#> attribute(s):
#>  PRISM_tmax_stable_4kmD1_19880611_bil.bil 
#>  Min.   : NA                              
#>  1st Qu.: NA                              
#>  Median : NA                              
#>  Mean   :NaN                              
#>  3rd Qu.: NA                              
#>  Max.   : NA                              
#>  NA's   :1                                
#> dimension(s):
#>          from to offset delta                       refsys point
#> geometry    1  1     NA    NA +proj=longlat +ellps=GRS8... FALSE
#>                                  values
#> geometry MULTIPOLYGON (((-76.0166 35...

Created on 2019-04-01 by the reprex package (v0.2.0.9000).

Mar 31 '19 20:03 kendonB

Did you update stars? I see

> aggregate(x1_proxy, nc, mean)
although coordinates are longitude/latitude, st_intersects assumes that they are planar
although coordinates are longitude/latitude, st_intersects assumes that they are planar
stars object with 1 dimensions and 1 attribute
attribute(s):
 PRISM_tmax_stable_4kmD1_19880611_bil.bil 
 Min.   :18.91                            
 1st Qu.:18.91                            
 Median :18.91                            
 Mean   :18.91                            
 3rd Qu.:18.91                            
 Max.   :18.91                            
dimension(s):
         from to offset delta                       refsys point
geometry    1  1     NA    NA +proj=longlat +ellps=GRS8... FALSE
                                 values
geometry MULTIPOLYGON (((-76.0166 35...
Warning message:
In c.stars(list(PRISM_tmax_stable_4kmD1_19880611_bil.bil = 18.9099998474121),  :
  along argument ignored; maybe you wanted to use st_redimension?

Mar 31 '19 21:03 edzer

library(sf)
library(tidyverse)
library(stars)
#> Loading required package: abind
x1 = read_stars("PRISM_tmax_stable_4kmD1_19880611_bil/PRISM_tmax_stable_4kmD1_19880611_bil.bil")
nc = read_sf(system.file("shape/nc.shp", package="sf"))

nc = nc %>% st_transform(st_crs(x1)) %>% filter(CNTY_ID == 2099)

nc$geometry[[1]][[1]] <- NULL
x1_proxy = read_stars("PRISM_tmax_stable_4kmD1_19880611_bil/PRISM_tmax_stable_4kmD1_19880611_bil.bil", proxy = TRUE)

bench::mark(
  # Normal one is slow
  aggregate(x1, nc, mean),
  # Fast one pre-shrinks the raster using a buffer around the polygon
  {
    x1small <- x1[nc %>% st_buffer(max(st_dimensions(x1)[[1]]$delta, st_dimensions(x1)[[2]]$delta)*2)]
    aggregate(x1small, nc, mean)
  },
  # Fast
  aggregate(x1_proxy, nc, mean),
  check = FALSE
)
#> # A tibble: 3 x 10
#>   expression     min    mean  median     max `itr/sec` mem_alloc  n_gc
#>   <chr>      <bch:t> <bch:t> <bch:t> <bch:t>     <dbl> <bch:byt> <dbl>
#> 1 aggregate~   9.15s   9.15s   9.15s   9.15s     0.109     189MB    17
#> 2 {...       11.83ms 13.99ms 13.26ms 23.68ms    71.5       751KB     1
#> 3 aggregate~ 11.17ms  12.7ms 12.25ms 16.45ms    78.7       426KB     1
#> # ... with 2 more variables: n_itr <int>, total_time <bch:tm>

^{Created on 2019-04-01 by the reprex package (v0.2.1.9000)} Very fast. Note, however that the results in 3 are not equal to the first two using all.equal.

Mar 31 '19 21:03 kendonB

Interesting is also the difference in memory footprint.

Of course the gain with this approach is only large if by is geometrically small compared to x. I agree that sth should be done, though, with the aggregate.stars method too.

Apr 01 '19 10:04 edzer

I refresh this thread because I did some benchmarks between stars, terra and exactextractr, and as a result stars is the slowest. I think this issue is worth exploring more. In stars, I read the raster data as proxy. There are timings:

# raster dim: [7871, 7771, 10]
# 2000 polygons
# aggregate by mean

exactextractr: 15 s
terra: 39 s
stars: 398 s

Feb 08 '21 22:02 kadyb

Did you also try this on non-proxy data? Your benchmark would be most useful it was reproducible, as the one from OP.

Feb 09 '21 10:02 edzer

I tried but encountered memory management problems.

# raster dim: [7871, 7771, 2]
# 100 polygons
# aggregate by mean

exactextractr: 0.93 s
terra: 1.02 s

stars
system.time(aggregate(ras, buffers, FUN = mean))
#> Error: memory exhausted (limit reached?)
#> Error during wrapup: memory exhausted (limit reached?)
#> Error: no more error handlers available (recursive errors?); invoking 'abort' restart
#> Timing stopped at: 31.61 4.36 37.25

format(object.size(ras), units = "auto")
#> "933.3 Mb" # I still had ~4 GB of free RAM.

Edit: I will make all the data and scripts available in the near future.

Feb 09 '21 11:02 kadyb

I created a repository with data and some benchmarking scripts:

Code: https://github.com/kadyb/raster-benchmark/blob/main/stars/zonal.R
Raster data: https://drive.google.com/uc?id=1lzglfQJqlQh9OWT_-czc5L0hQ1AhoR8M&export=download
Vector data: https://github.com/kadyb/raster-benchmark/blob/main/data/vector/buffers.gpkg

As I mentioned in this thread, I used aggregation by polygons with raster loaded as proxy, because the function didn't work when loaded all data into memory. Next week I will test it on a better PC.

You can see the first version of the benchmark results (I also included Python packages) there: https://kadyb.github.io/raster-benchmark/report.html

Feb 11 '21 17:02 kadyb

Thanks, looks great, will try to run them!

Feb 11 '21 20:02 edzer

I plan to develop it further. I will add a bash script to automatically run all tests and report when new package versions will be released. If you have any comments or suggestions, please let me know. I think this can be a really useful and interesting for the community.

Feb 11 '21 21:02 kadyb

Did you also try this on non-proxy data?

There are results:

# raster dim: [7871, 7771, 10]
# 2000 polygons
# aggregate by mean
# just one run

# timings in seconds
stars (proxy = FALSE):  1213
stars (proxy = TRUE):   264
exactextractr:          14
terra:                  27

stars (proxy = FALSE) uses a lot of RAM. After 7 min it is 40 GB, after 10 min is 50 GB and the peak was on 58 GB. The second iteration failed because it ran out of RAM (64 GB) and the R process was killed by the system. It's worth pointing out that stars (proxy = TRUE) is much faster than stars (proxy = FALSE) and only takes up 2 GB of RAM.

Feb 15 '21 12:02 kadyb