aggregate could be heaps faster if polygon is small relative to raster
https://github.com/r-spatial/stars/files/3020397/PRISM_tmax_stable_4kmD1_19880611_bil.zip
The below benchmarks aggregate(stars_object, sf_polygon, mean) against the same where the stars_object is preshrunk using sf_polygon and [ with a buffer that's sufficiently large to guarantee that all overlapping grid cells are included.
library(sf)
#> Linking to GEOS 3.6.1, GDAL 2.2.3, PROJ 4.9.3
library(tidyverse)
library(stars)
#> Loading required package: abind
x1 = read_stars("D:/prism/PRISM_tmax_stable_4kmD1_19880611_bil/PRISM_tmax_stable_4kmD1_19880611_bil.bil")
nc = read_sf(system.file("shape/nc.shp", package="sf"))
nc = nc %>% st_transform(st_crs(x1)) %>% filter(CNTY_ID == 2099)
nc$geometry[[1]][[1]] <- NULL
#> although coordinates are longitude/latitude, st_intersects assumes that they are planar
bench::mark(
# Normal one is slow
aggregate(x1, nc, mean),
# Fast one pre-shrinks the raster using a buffer around the polygon
{
x1small <- x1[nc %>% st_buffer(max(st_dimensions(x1)[[1]]$delta, st_dimensions(x1)[[2]]$delta)*2)]
aggregate(x1small, nc, mean)
}
)
#> # A tibble: 2 x 10
#> expression min mean median max `itr/sec` mem_alloc n_gc n_itr
#> <chr> <bch:> <bch:t> <bch:t> <bch:t> <dbl> <bch:byt> <dbl> <int>
#> 1 aggregate~ 6.49s 6.49s 6.49s 6.49s 0.154 165MB 16 1
#> 2 {... 13.3ms 15.02ms 14.28ms 19.21ms 66.6 328KB 2 34
#> # ... with 1 more variable: total_time <bch:tm>
Created on 2019-03-29 by the reprex package (v0.2.1.9000)
Could aggregate itself be modified to use this strategy to speed things up?
I committed changes that I think do sth similar but for a stars_proxy object; could you do the benchmark again when reading x1 with proxy = TRUE?
The operation doesn't seem to work with proxy = TRUE. It only pulls out NAs:
library(sf)
#> Warning: package 'sf' was built under R version 3.5.3
#> Linking to GEOS 3.6.1, GDAL 2.2.3, PROJ 4.9.3
library(tidyverse)
#> Warning: package 'dplyr' was built under R version 3.5.2
library(stars)
#> Loading required package: abind
#> Warning: package 'abind' was built under R version 3.5.2
x1_proxy = read_stars("PRISM_tmax_stable_4kmD1_19880611_bil/PRISM_tmax_stable_4kmD1_19880611_bil.bil", proxy = TRUE)
nc = read_sf(system.file("shape/nc.shp", package="sf"))
nc = nc %>% st_transform(st_crs(x1_proxy)) %>% filter(CNTY_ID == 2099)
nc$geometry[[1]][[1]] <- NULL
aggregate(x1_proxy, nc, mean)
#> although coordinates are longitude/latitude, st_intersects assumes that they are planar
#> Warning in mean.default(newX[, i], ...): argument is not numeric or
#> logical: returning NA
#> stars object with 1 dimensions and 1 attribute
#> attribute(s):
#> PRISM_tmax_stable_4kmD1_19880611_bil.bil
#> Min. : NA
#> 1st Qu.: NA
#> Median : NA
#> Mean :NaN
#> 3rd Qu.: NA
#> Max. : NA
#> NA's :1
#> dimension(s):
#> from to offset delta refsys point
#> geometry 1 1 NA NA +proj=longlat +ellps=GRS8... FALSE
#> values
#> geometry MULTIPOLYGON (((-76.0166 35...
Created on 2019-04-01 by the reprex package (v0.2.0.9000).
Did you update stars? I see
> aggregate(x1_proxy, nc, mean)
although coordinates are longitude/latitude, st_intersects assumes that they are planar
although coordinates are longitude/latitude, st_intersects assumes that they are planar
stars object with 1 dimensions and 1 attribute
attribute(s):
PRISM_tmax_stable_4kmD1_19880611_bil.bil
Min. :18.91
1st Qu.:18.91
Median :18.91
Mean :18.91
3rd Qu.:18.91
Max. :18.91
dimension(s):
from to offset delta refsys point
geometry 1 1 NA NA +proj=longlat +ellps=GRS8... FALSE
values
geometry MULTIPOLYGON (((-76.0166 35...
Warning message:
In c.stars(list(PRISM_tmax_stable_4kmD1_19880611_bil.bil = 18.9099998474121), :
along argument ignored; maybe you wanted to use st_redimension?
library(sf)
library(tidyverse)
library(stars)
#> Loading required package: abind
x1 = read_stars("PRISM_tmax_stable_4kmD1_19880611_bil/PRISM_tmax_stable_4kmD1_19880611_bil.bil")
nc = read_sf(system.file("shape/nc.shp", package="sf"))
nc = nc %>% st_transform(st_crs(x1)) %>% filter(CNTY_ID == 2099)
nc$geometry[[1]][[1]] <- NULL
x1_proxy = read_stars("PRISM_tmax_stable_4kmD1_19880611_bil/PRISM_tmax_stable_4kmD1_19880611_bil.bil", proxy = TRUE)
bench::mark(
# Normal one is slow
aggregate(x1, nc, mean),
# Fast one pre-shrinks the raster using a buffer around the polygon
{
x1small <- x1[nc %>% st_buffer(max(st_dimensions(x1)[[1]]$delta, st_dimensions(x1)[[2]]$delta)*2)]
aggregate(x1small, nc, mean)
},
# Fast
aggregate(x1_proxy, nc, mean),
check = FALSE
)
#> # A tibble: 3 x 10
#> expression min mean median max `itr/sec` mem_alloc n_gc
#> <chr> <bch:t> <bch:t> <bch:t> <bch:t> <dbl> <bch:byt> <dbl>
#> 1 aggregate~ 9.15s 9.15s 9.15s 9.15s 0.109 189MB 17
#> 2 {... 11.83ms 13.99ms 13.26ms 23.68ms 71.5 751KB 1
#> 3 aggregate~ 11.17ms 12.7ms 12.25ms 16.45ms 78.7 426KB 1
#> # ... with 2 more variables: n_itr <int>, total_time <bch:tm>
Created on 2019-04-01 by the reprex package (v0.2.1.9000)
Very fast. Note, however that the results in 3 are not equal to the first two using all.equal.
Interesting is also the difference in memory footprint.
Of course the gain with this approach is only large if by is geometrically small compared to x. I agree that sth should be done, though, with the aggregate.stars method too.
I refresh this thread because I did some benchmarks between stars, terra and exactextractr, and as a result stars is the slowest. I think this issue is worth exploring more. In stars, I read the raster data as proxy. There are timings:
# raster dim: [7871, 7771, 10]
# 2000 polygons
# aggregate by mean
exactextractr: 15 s
terra: 39 s
stars: 398 s
Did you also try this on non-proxy data? Your benchmark would be most useful it was reproducible, as the one from OP.
I tried but encountered memory management problems.
# raster dim: [7871, 7771, 2]
# 100 polygons
# aggregate by mean
exactextractr: 0.93 s
terra: 1.02 s
stars
system.time(aggregate(ras, buffers, FUN = mean))
#> Error: memory exhausted (limit reached?)
#> Error during wrapup: memory exhausted (limit reached?)
#> Error: no more error handlers available (recursive errors?); invoking 'abort' restart
#> Timing stopped at: 31.61 4.36 37.25
format(object.size(ras), units = "auto")
#> "933.3 Mb" # I still had ~4 GB of free RAM.
Edit: I will make all the data and scripts available in the near future.
I created a repository with data and some benchmarking scripts:
- Code: https://github.com/kadyb/raster-benchmark/blob/main/stars/zonal.R
- Raster data: https://drive.google.com/uc?id=1lzglfQJqlQh9OWT_-czc5L0hQ1AhoR8M&export=download
- Vector data: https://github.com/kadyb/raster-benchmark/blob/main/data/vector/buffers.gpkg
As I mentioned in this thread, I used aggregation by polygons with raster loaded as proxy, because the function didn't work when loaded all data into memory. Next week I will test it on a better PC.
You can see the first version of the benchmark results (I also included Python packages) there: https://kadyb.github.io/raster-benchmark/report.html
Thanks, looks great, will try to run them!
I plan to develop it further. I will add a bash script to automatically run all tests and report when new package versions will be released. If you have any comments or suggestions, please let me know. I think this can be a really useful and interesting for the community.
Did you also try this on non-proxy data?
There are results:
# raster dim: [7871, 7771, 10]
# 2000 polygons
# aggregate by mean
# just one run
# timings in seconds
stars (proxy = FALSE): 1213
stars (proxy = TRUE): 264
exactextractr: 14
terra: 27
stars (proxy = FALSE) uses a lot of RAM. After 7 min it is 40 GB, after 10 min is 50 GB and the peak was on 58 GB. The second iteration failed because it ran out of RAM (64 GB) and the R process was killed by the system.
It's worth pointing out that stars (proxy = TRUE) is much faster than stars (proxy = FALSE) and only takes up 2 GB of RAM.