terra icon indicating copy to clipboard operation
terra copied to clipboard

tapp much slower than raster::stackApply?

Open chrishaak opened this issue 2 years ago • 4 comments

I'm trying to take a spatraster (let's call it "r", which comprises 10013 daily observations, with the dimension/properties shown below, generated from a netcdf file) and use it to produce a spatraster of corresponding monthly means (let's call it "m");

class       : SpatRaster 
dimensions  : 241, 241, 10013  (nrow, ncol, nlyr)
resolution  : 0.08333333, 0.08333333  (x, y)
extent      : -78.04167, -57.95833, 31.95833, 52.04167  (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 
source      : CMEMS-GLOBAL_001_030-thetao-1993_2020.nc 
varname     : thetao (Temperature) 
names       : theta~499_1, theta~499_2, theta~499_3, theta~499_4, theta~499_5, theta~499_6, ... 
unit        :   degrees_C,   degrees_C,   degrees_C,   degrees_C,   degrees_C,   degrees_C, ... 
time (days) : 1993-01-01 to 2020-05-31 

This seems like it should be a no-brainer using:

m <- terra::tapp(r, "yearmonths", fun = mean)

However, running the above code doesn't appear to complete in any reasonable amount of time.

In the past, using raster::stackapply (and a rasterbrick object for r, produced from the same ncdf file), I would make a vector of grouping indices for month_year and then use the code:

m <- stackApply(r, indices=month_year, fun=mean, na.rm=TRUE) 

And it would take just 59 seconds to complete.

Given the speed gains that I see almost everywhere else in terra (over raster), including tapp being faster than stackapply in other circumstances, I cannot figure out what is going on here? I have tried substituting layer indices (in place of "yearmonths"), but this makes no difference. Before I try to create a full example, and/or upload any files, I was hoping you might be able to tell me if this is unexpected behavior, or if I am doing something stupid?

update: For the sake of experimentation, I subsetted 100 layers from the spatraster "r", to see if that made a difference and then ran the same code:

rsmall <- r[[1:100]]
m <- terra::tapp(rsmall, "yearmonths", fun = mean)

This did in fact complete, taking 99 seconds. This implies that (assuming linear scaling) it would take roughly 10000 seconds, or approaching 3 hours, to complete a task that takes 59 seconds using raster::stackapply?

Running terra version 1.7-29 with RCPP 1.0.10 on in R 4.2.2 on ubuntu 20.04

Many thanks in advance for any thoughts you can offer here... C

chrishaak avatar May 01 '23 17:05 chrishaak

Thanks for bringing this to my attention. "raster" does not use GDAL to read NetCDF files (it uses the ncdf library directly, via the ncdf4 package). "terra" uses GDAL and this can be much slower when you have many time-steps. The GDAL data model is based on layers that are read one by one.

I had seen that initially, when I developed terra; but when I later did some benchmarking, including with the new GDAL multi-dim array interface, I did not find much of a difference, so I did not pursue it further. Also, I would only use format-specific code if there was a large benefit.

I probably use example files that somehow did not have this problem. I will try again, based on the information you provided; and see if I can reproduce this and look at a workaround via the multi-dim interface, or by directly using the NetCDF lib.

rhijmans avatar May 01 '23 20:05 rhijmans

Thanks for the quick reply; that makes sense. I imagine you can create what you need for testing pretty straightforwardly, but please do let me know If I can help by supplying any files (they are large though). Would one expect to see a similar "bottleneck" occur using terra::roll on the same (netcdf-sourced) data? Is there any way the issue be circumvented by somehow loading everything into RAM upfront (assuming RAM was not limiting?). Or is it not so simple?

Thanks again! c

chrishaak avatar May 01 '23 20:05 chrishaak

If RAM is not limiting you could create a RasterBrick, load that into memory, and then coerce it into a SpatRaster. If, instead, you would read the SpatRaster values into memory from the file, I would expect that step would take a long time.

rhijmans avatar May 01 '23 21:05 rhijmans

Based on your suggestion above, I read the netcdf in as a raster::brick, then used raster::readAll to force it into memory, and then coerced that into a spatraster. I then used terra::tapp on the spatraster as described above, and the whole thing ran in seconds! So it does seem pretty clear that the bottleneck is happening at the terra "read-in" stage (as you indicated).

The one hiccup here is this: when I coerce a rasterbrick (on disk) into a spatraster, the spatraster has a "time" slot (as it should!). But for some reason, when I coerce the same rasterbrick (from memory) into a spatraster, the "time" slot disappears, and I have to re-assign it afterward using terra:time (essentially I just copy it back to the object from the on-disk version of the rasterbrick). Any idea why "time" gets lost from the in-memory version?

chrishaak avatar May 01 '23 22:05 chrishaak