tmap icon indicating copy to clipboard operation
tmap copied to clipboard

complex **stars**: how to approach them

Open mtennekes opened this issue 3 years ago • 12 comments

Haven't much experience with terra, but so far, SpatRaster objects seem pretty straight forward. They always have three dimensions: nrow, ncol, and, nlyr.

stars objects are much more complex:

  • they support non-regular rasters (does terra also support them?),
  • they can have multiple attributes and multiple dimensions (other than x and y).

My question: how should we approach data variables of stars objects in tmap?

In case there are multiple attributes, these attributes are considered variable names (as was the case in the current version of tmap). Example:

library(tmap)
land
#> stars object with 2 dimensions and 4 attributes
#> attribute(s):
#>                cover                              cover_cls     
#>  Water bodies     :393060   Water                      :393060  
#>  Snow / Ice       : 61986   Snow/ice                   : 61986  
#>  Herbaceous       : 21377   Forest                     : 48851  
#>  Tree Open        : 16171   Other natural vegetation   : 32611  
#>  Sparse vegetation: 12247   Bare area/Sparse vegetation: 26904  
#>  Cropland         : 11658   Cropland                   : 17843  
#>  (Other)          : 66701   (Other)                    :  1945  
#>      trees          elevation     
#>  Min.   :  0.0    Min.   :-412    
#>  1st Qu.:  0.0    1st Qu.: 218    
#>  Median :  0.0    Median : 608    
#>  Mean   : 15.6    Mean   :1140    
#>  3rd Qu.: 19.0    3rd Qu.:1941    
#>  Max.   :100.0    Max.   :6410    
#>  NA's   :393060   NA's   :389580  
#> dimension(s):
#>   from   to offset.xmin delta.xmax                       refsys point values
#> x    1 1080        -180   0.333333 +proj=longlat +ellps=WGS8...  NULL   NULL
#> y    1  540          90  -0.333333 +proj=longlat +ellps=WGS8...  NULL   NULL
#>   x/y
#> x [x]
#> y [y]
tm_shape(land) + tm_raster(c("cover_cls", "elevation"))
#> Variable(s) "col" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.

Created on 2021-09-11 by the reprex package (v2.0.0)

Now, let's compare stars to terra with the landsat dataset:

library(tmap)
library(terra)
library(stars)
library(spDataLarge)
    
(landsat_stars = read_stars(system.file("raster/landsat.tif", package = "spDataLarge")))
#> stars object with 3 dimensions and 1 attribute
#> attribute(s), summary of first 1e+05 cells:
#>              Min. 1st Qu. Median     Mean 3rd Qu.  Max.
#> landsat.tif  7701    8412   8739 8941.313    9258 14072
#> dimension(s):
#>      from   to  offset delta                refsys point values x/y
#> x       1 1128  301905    30 WGS 84 / UTM zone 12N FALSE   NULL [x]
#> y       1 1428 4154085   -30 WGS 84 / UTM zone 12N FALSE   NULL [y]
#> band    1    4      NA    NA                    NA    NA   NULL
(landsat_terra = rast(system.file("raster/landsat.tif", package = "spDataLarge")))
#> class       : SpatRaster 
#> dimensions  : 1428, 1128, 4  (nrow, ncol, nlyr)
#> resolution  : 30, 30  (x, y)
#> extent      : 301905, 335745, 4111245, 4154085  (xmin, xmax, ymin, ymax)
#> coord. ref. : +proj=utm +zone=12 +datum=WGS84 +units=m +no_defs 
#> source      : landsat.tif 
#> names       : lan_1, lan_2, lan_3, lan_4 
#> min values  :  7550,  6404,  5678,  5252 
#> max values  : 19071, 22051, 25780, 31961

The difference is that the bands in terra are translated to layers, whereas in stars they are translated to a new dimension (called band). The layers of terra are more similar to the attributes in stars than to the dimensions. names(landsat_stars) gives "landsat.tif", whereas names(landsat_terra) gives "lan_1" to "lan_4". This makes perfectly sense. I can use split to split the dimension values into different attributes.

Selecting these 4 variables is straight-forward (as is with multiple attributes in a stars object):

tm_shape(landsat_terra) +
    tm_raster(c("lan_1", "lan_2", "lan_3", "lan_4"), col.free = FALSE)

tm_shape(landsat_terra) +
    tm_rgb(MV("lan_4", "lan_3", "lan_2"), col.scale = tm_scale_rgb(maxValue = 31961))

Created on 2021-09-11 by the reprex package (v2.0.0)

The question however is how to call the dimension values of a stars object. Two obstacles: 1) a dimension may have numeric values, and 2) there can be more than one non-x-y-dimension.

In the tmap3 it is possible to use integers. See https://r-tmap.github.io/tmap-book/layers.html#raster-layer (Figure 5.8). This would only work if there is only one attribute, and only one non-x-y dimension.

tm_shape(landsat_stars +
    tm_raster(1:4, col.free = FALSE)

# not working yet in tmap4

Currently, each dimension in a stars object is interpreted as a variable, in this example band which I can use to group by:

tm_shape(landsat_stars) +
    tm_raster("landsat.tif") +
    tm_facets("band")
#> stars object downsampled to 888 by 1125 cells.

Suppose there are multiple attributes and multiple non-x-y dimensions. The attributes can be seen as one dimension. So it could be possible to have something like this:

tm_shape(my_stars_object) +
    tm_rgb(MV(3, 2, 1)) +
    tm_facets_wrap(by = "ATTRIBUTES")

where MV stands for multivariate, the numbers refer to the values of the first non-x-y band, and "ATTRIBUTES" to the 'attributes' dimension. This example would make sense if my_stars_objects contains satellite images taken at several times.

Let me know what you think. Ideas and suggestions are welcome! @Nowosad @edzer @rsbivand @rhijmans

mtennekes avatar Sep 11 '21 18:09 mtennekes

In case of a four-dimensional stars object, it would make sense to associate dimension 3 with facet rows, and dimension 4 with faced columns, or vice versa. I think that is the maximum logic you can get (or dimension 5 aligns with "page", but who wants paged figures?) with two-dimensional plots & facets.

stars attributes are there for allowing different data types, e.g. the first attribute can be a numeric, the second a logical or categorical variable, while all share the same dimensionality; not sure if terra layers allow different types.

edzer avatar Sep 11 '21 21:09 edzer

Just a few comments from me:

  1. I do not think that terra supports non-regular rasters.
  2. Yes, terra accepts different types (e.g., numerical and categorical) in the same object - see a code example below.
  3. @mtennekes I am still thinking about MV... Two capital letter function name looks strange compared to the rest of tmap syntax. How about just tm_mv()?
  4. I agree that a special name, e.g., "ATTRIBUTES" would be useful.
library(terra)
#> terra version 1.4.2
elev = rast(nrows = 6, ncols = 6, resolution = 0.5, 
            xmin = -1.5, xmax = 1.5, ymin = -1.5, ymax = 1.5,
            vals = 1:36)

grain_order = c("clay", "silt", "sand")
grain_char = sample(grain_order, 36, replace = TRUE)
grain_fact = factor(grain_char, levels = grain_order)
grain = rast(nrows = 6, ncols = 6, resolution = 0.5, 
             xmin = -1.5, xmax = 1.5, ymin = -1.5, ymax = 1.5,
             vals = grain_fact)

mix_rast = c(elev, grain)
plot(mix_rast)

Created on 2021-09-12 by the reprex package (v2.0.1)

Nowosad avatar Sep 12 '21 08:09 Nowosad

In case of a four-dimensional stars object, it would make sense to associate dimension 3 with facet rows, and dimension 4 with faced columns, or vice versa. I think that is the maximum logic you can get (or dimension 5 aligns with "page", but who wants paged figures?) with two-dimensional plots & facets.

stars attributes are there for allowing different data types, e.g. the first attribute can be a numeric, the second a logical or categorical variable, while all share the same dimensionality; not sure if terra layers allow different types.

Indeed. The theoretical limit of a stars object that can be visualised in tmap is even 6 dimensions (see also https://github.com/r-tmap/tmap/issues/497#issuecomment-690421691). The 6th dimension is used as multivariate, e.g. rgb bands:

tm_shape(stars_obj) +
tm_rgb(tm_mv(1, 2, 3) +
tm_facets_wrap(row = "month", col = "time", page = "year")

The question is how should tmap detect to which dimension values 1, 2, and 3 belong. Probably, I'll go for the first non-x-y dimension that is not used by tm_facets.

Thx @Nowosad . Yes, tm_mv is better.

mtennekes avatar Sep 12 '21 10:09 mtennekes

My approach for processing terra objects and stars objects with raster is the following:

  1. Downsample (or aggregate) to a resolution that is sufficient for visualization (determined by the option max.raster).
  2. Store the data in a data.table, and save the raster specifications (crs, bbx, ncol/nrow).

Which resolution reduction method shall we use? Regular sampling or aggregation? In the current dev version:

  • I use st_downsample from stars. Friendly request for @edzer : could you please export it?
  • I use aggregate from terra which is very convenient and quick.

This works well, but is not consistent with each other. Therefore:

  • Is there an aggregation function in stars that works in the same way as terra's aggregate, so with an argument fact? (which can be a vector of aggregation factors per dimension, like n from st_downsample).
  • Is there a function in terra that is similar to stars:::st_downsample?

Ideally, the user should be able to specify whether to aggregate or to sample.

mtennekes avatar Sep 12 '21 10:09 mtennekes

Side note: for the aggregation method, the user should be aware (or ideally be able to change) the aggregation function. By default it is mean, which makes sense for most applications (e.g. satellite images). However, for other cases, the sum might be better (e.g. the number of people per raster cell).

For the sample method, the same holds: e.g. if the data value per raster cell is an absolute value (e.g. population count), then sampling is not a good method (unless population density is used).

mtennekes avatar Sep 12 '21 10:09 mtennekes

I use aggregate from terra which is very convenient and quick.

Have you tried on a 10000 x 10000 image, or larger, and compared to down sampling when the image contains overviews?

By default it is mean, which makes sense for most applications (e.g. satellite images)

Not sure I agree: the variance changes, what you show are no longer pixel values; I think there is a reason why the default for GDAL's resampling methods is near (nearest neighbor), besides speed. Ask a remote sensing specialist?

Exporting st_downsample - yes, that makes sense.

edzer avatar Sep 12 '21 12:09 edzer

Exporting st_downsample - yes, that makes sense.

still need to think this through; the more common pattern is to use st_as_stars() on a stars_proxy object and use the downscale argument.

edzer avatar Sep 12 '21 12:09 edzer

Agree that near is indeed the saver and faster choice.

Exporting st_downsample - yes, that makes sense.

still need to think this through; the more common pattern is to use st_as_stars() on a stars_proxy object and use the downscale argument.

This seems a detour when you already have a fairly large (but not huge) stars object (such as landsat_stars from above) in memory. How can I downsample it without having a function such st_downsample?

mtennekes avatar Sep 14 '21 15:09 mtennekes

Just to confirm that a SpatRaster always has three dimensions (nrow, ncol, nlyr). There is also the SpatRasterDataSet that can have 1 additional dimension ("group", "variable"), in the future I may allow for more dimensions (nested variables). The SpatRasterDataSet is mostly for reading data from files that have groups, and for combining SpatRasters for computations where each multi-layer SpatRaster is a variable. In the context of tmap I would not worry about these at this point in time.

For downsampling, terra uses spatSample (see Downsample function? rspatial/terra#330). Occasionally, that can of course lead to showing undesired outliers. But it is much faster than aggregate for large datasets. I suppose that there could be an option to use aggregate instead, but it might be better to leave aggregation to the user as a pre-processing step, so that they can also decide whether to use mean, sum, median, na.rm=T, etc. Mapping count data can become much more meaningful when you can actually see all the raster cells, but whether someone wants to show the summed or the sampled data will depend on the context.

rhijmans avatar Sep 15 '21 07:09 rhijmans

@rhijmans it seems that e.g.

spatSample(landsat_terra, 1e4, as.raster=TRUE)

doesn't return a very useful raster; when plotted: x

edzer avatar Sep 15 '21 10:09 edzer

@edzer it should work better with regular sampling:

x <- spatSample(landsat_terra, 1e4, method="regular", as.raster=TRUE)

rhijmans avatar Sep 15 '21 11:09 rhijmans

Thanks!

edzer avatar Sep 15 '21 11:09 edzer