stars icon indicating copy to clipboard operation
stars copied to clipboard

What does `st_extract` of the stars R package do when you have an extra dimension?

Open dmkaplan2000 opened this issue 6 months ago • 3 comments

I have read in to R as a stars object some gridded temperature data (from Copernicus) that has the following form, including 4 depth layers:

> thetao
stars object with 4 dimensions and 1 attribute
attribute(s), summary of first 1e+05 cells:
                Min.  1st Qu.   Median    Mean  3rd Qu.     Max.  NA's
thetao [°C] 14.89434 22.29057 24.02499 23.2406 24.85925 26.16007 88129
dimension(s):
      from  to offset    delta  refsys                                              values x/y
x        1 973  19.46  0.08333  WGS 84                                                NULL [x]
y        1 721  30.04 -0.08333  WGS 84                                                NULL [y]
depth    1   4     NA       NA udunits [0.494025,1.541375) [m],...,[3.819495,4.993321) [m]    
time     1  12     NA       NA POSIXct                           1993-01-01,...,1993-12-01    

I also have a space-time point dataset for which I want to extract the mean temperature values over all depth layers. This layer looks as follows:

> dd
Simple feature collection with 5 features and 1 field
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 48 ymin: -5 xmax: 55.7 ymax: 2.7
Geodetic CRS:  WGS 84
       yrmon            the_geom
1 1993-05-01   POINT (55.7 0.85)
2 1993-05-01 POINT (55.55 -0.35)
3 1993-08-01   POINT (50.75 2.7)
4 1993-08-01     POINT (49.2 -1)
5 1993-08-01       POINT (48 -5)

To extract the temperature data, I tried using st_extract with the 4 depth layers all at once thinking it would magically just average over depth layers, but when I compared this to slicing out individual depth layers, extracting values, and then averaging the 4 values, I got a very different result. Can someone explain what st_extract is doing in the two cases, and which is correct?

The specific code I ran to compare the two was:

x = st_extract(thetao,dd,time_column="yrmon")

for (i in 1:4) {
  t = slice(thetao,"depth",i)
  xx = st_extract(t,dd,time_column="yrmon")
  x[paste0("thetao",i)]=xx$thetao
}

x$thetaomn = (x$thetao1 + x$thetao2 + x$thetao3 + x$thetao4)/4

The result of this was:

> x
Simple feature collection with 5 features and 8 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 48 ymin: -5 xmax: 55.7 ymax: 2.7
Geodetic CRS:  WGS 84
         thetao       time      yrmon            the_geom       thetao1       thetao2       thetao3       thetao4      thetaomn
1 27.32905 [°C] 1993-05-01 1993-05-01   POINT (55.7 0.85) 29.85672 [°C] 29.77908 [°C] 29.72781 [°C] 29.70510 [°C] 29.76717 [°C]
2 27.31513 [°C] 1993-05-01 1993-05-01 POINT (55.55 -0.35) 29.79519 [°C] 29.72854 [°C] 29.68166 [°C] 29.65896 [°C] 29.71609 [°C]
3 27.42647 [°C] 1993-08-01 1993-08-01   POINT (50.75 2.7) 25.90078 [°C] 25.89859 [°C] 25.89492 [°C] 25.89126 [°C] 25.89639 [°C]
4 27.42940 [°C] 1993-08-01 1993-08-01     POINT (49.2 -1) 25.84365 [°C] 25.83853 [°C] 25.82974 [°C] 25.82168 [°C] 25.83340 [°C]
5 28.39988 [°C] 1993-08-01 1993-08-01       POINT (48 -5) 25.00061 [°C] 24.99841 [°C] 24.99622 [°C] 24.99255 [°C] 24.99695 [°C]

As the values are very different, clearly one of these is very wrong, but which?

dmkaplan2000 avatar Jun 20 '25 08:06 dmkaplan2000

Note that I am aware that I can just average the raster itself over depth and then apply st_extract, but I still want to understand what st_extract is doing when there are extra dimensions as I can imagine it might permit me to do more complex operations that aren't as simple as just averaging the 4 depth layers. If not, I would hope there would be a warning explaining that there are extra dimensions that are being handled in some way.

The specific bash code used to download the copernicus data was:

minlon=19.5
maxlon=100.5
minlat=-30
maxlat=30

copernicusmarine subset -i "cmems_mod_glo_phy_my_0.083deg_P1M-m" \
  -x ${minlon} -X ${maxlon} -y ${minlat} -Y ${maxlat} \
  -t 1993-01-01 -T 1993-12-31  \
  -z 0 -Z 5 \
  --variable mlotst --variable thetao \
  -o ./ \
  -f indian.1993.nc

And some code to create at least the start of the point data is:

library(tidyverse)
library(sf)
dd = data.frame(
  yrmon = as.Date(c("1993-05-01","1993-05-01","1993-08-01","1993-08-01","1993-08-01")),
  lon = c(55.7,55.55,50.75,49.2,48),
  lat = c(0.85,-0.35,2.7,-1,-5)
) |>
  st_as_sf(coords=c("lon","lat"),sf_column_name="the_geom")

dmkaplan2000 avatar Jun 20 '25 08:06 dmkaplan2000

Thanks for the clear problem description!

st_extract(thetao[,,,1], dd, time_column = "yrmon")

gives you the correct values for the first depth, but doing this for all depths (your example) seems completely off.

edzer avatar Sep 30 '25 16:09 edzer

This fix brings us this:

library(tidyverse)
# ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
# ✔ dplyr     1.1.4     ✔ readr     2.1.5
# ✔ forcats   1.0.0     ✔ stringr   1.5.1
# ✔ ggplot2   3.5.2     ✔ tibble    3.3.0
# ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
# ✔ purrr     1.1.0     
# ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
# ✖ dplyr::filter() masks stats::filter()
# ✖ dplyr::lag()    masks stats::lag()
# ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(sf)
# Linking to GEOS 3.12.2, GDAL 3.11.3, PROJ 9.4.1; sf_use_s2() is TRUE
library(stars)
# Loading required package: abind
(thetao = read_mdim("indian.1993.nc"))
# stars object with 4 dimensions and 1 attribute
# attribute(s), summary of first 1e+05 cells:
#                 Min.  1st Qu.   Median     Mean  3rd Qu.     Max.  NA's
# thetao [°C] 21.13916 24.27915 25.64296 25.49449 26.74236 29.99881 19669
# dimension(s):
#           from  to offset   delta         refsys
# longitude    1 973  19.46 0.08333 WGS 84 (CRS84)
# latitude     1 721 -30.04 0.08333 WGS 84 (CRS84)
# depth        1   4     NA      NA        udunits
# time         1  12     NA      NA           Date
#                                                        values x/y
# longitude                                                NULL [x]
# latitude                                                 NULL [y]
# depth     [0.494025,1.541375) [m],...,[3.819495,4.993321) [m]    
# time                                1993-01-01,...,1993-12-01    
dd = data.frame(
  yrmon = as.Date(c("1993-05-01","1993-05-01","1993-08-01","1993-08-01","1993-08-01")),
  lon = c(55.7,55.55,50.75,49.2,48),
  lat = c(0.85,-0.35,2.7,-1,-5)
) |>
  st_as_sf(coords=c("lon","lat"),sf_column_name="the_geom", crs = st_crs(thetao))
dd
# Simple feature collection with 5 features and 1 field
# Geometry type: POINT
# Dimension:     XY
# Bounding box:  xmin: 48 ymin: -5 xmax: 55.7 ymax: 2.7
# Geodetic CRS:  WGS 84 (CRS84)
#        yrmon            the_geom
# 1 1993-05-01   POINT (55.7 0.85)
# 2 1993-05-01 POINT (55.55 -0.35)
# 3 1993-08-01   POINT (50.75 2.7)
# 4 1993-08-01     POINT (49.2 -1)
# 5 1993-08-01       POINT (48 -5)
x = st_extract(thetao, dd)
all.equal( as.vector(units::drop_units(x[,,1,1][[1]])), units::drop_units(st_extract(thetao[,,,1,1], dd)[[1]]))
# [1] TRUE
all.equal( as.vector(units::drop_units(x[,,1,3][[1]])), units::drop_units(st_extract(thetao[,,,1,3], dd)[[1]]))
# [1] TRUE
all.equal( as.vector(units::drop_units(x[,,3,8][[1]])), units::drop_units(st_extract(thetao[,,,3,8], dd)[[1]]))
# [1] TRUE


match(dd$yrmon, st_get_dimension_values(thetao, "time"))
# [1] 5 5 8 8 8
(x2 = st_extract(thetao[,,,1], dd, time_column = "yrmon"))
# Simple feature collection with 5 features and 3 fields
# Geometry type: POINT
# Dimension:     XY
# Bounding box:  xmin: 48 ymin: -5 xmax: 55.7 ymax: 2.7
# Geodetic CRS:  WGS 84 (CRS84)
#          thetao       time      yrmon            the_geom
# 1 29.85672 [°C] 1993-05-01 1993-05-01   POINT (55.7 0.85)
# 2 29.79519 [°C] 1993-05-01 1993-05-01 POINT (55.55 -0.35)
# 3 25.90078 [°C] 1993-08-01 1993-08-01   POINT (50.75 2.7)
# 4 25.84365 [°C] 1993-08-01 1993-08-01     POINT (49.2 -1)
# 5 25.00061 [°C] 1993-08-01 1993-08-01       POINT (48 -5)
all.equal(x[[1]][cbind(1:5,1,c(5,5,8,8,8))], x2$thetao) # TRUE
# [1] TRUE

x3 = st_extract(thetao, dd, time_column = "yrmon")
print(x3, n = 20)
# Simple feature collection with 20 features and 3 fields
# Geometry type: POINT
# Dimension:     XY
# Bounding box:  xmin: 48 ymin: -5 xmax: 55.7 ymax: 2.7
# Geodetic CRS:  WGS 84 (CRS84)
#           thetao      yrmon        depth            the_geom
# 1  29.85672 [°C] 1993-05-01 0.494025 [m]   POINT (55.7 0.85)
# 2  29.77908 [°C] 1993-05-01 1.541375 [m]   POINT (55.7 0.85)
# 3  29.72781 [°C] 1993-05-01 2.645669 [m]   POINT (55.7 0.85)
# 4  29.70510 [°C] 1993-05-01 3.819495 [m]   POINT (55.7 0.85)
# 5  29.79519 [°C] 1993-05-01 0.494025 [m] POINT (55.55 -0.35)
# 6  29.72854 [°C] 1993-05-01 1.541375 [m] POINT (55.55 -0.35)
# 7  29.68166 [°C] 1993-05-01 2.645669 [m] POINT (55.55 -0.35)
# 8  29.65896 [°C] 1993-05-01 3.819495 [m] POINT (55.55 -0.35)
# 9  25.90078 [°C] 1993-08-01 0.494025 [m]   POINT (50.75 2.7)
# 10 25.89859 [°C] 1993-08-01 1.541375 [m]   POINT (50.75 2.7)
# 11 25.89492 [°C] 1993-08-01 2.645669 [m]   POINT (50.75 2.7)
# 12 25.89126 [°C] 1993-08-01 3.819495 [m]   POINT (50.75 2.7)
# 13 25.84365 [°C] 1993-08-01 0.494025 [m]     POINT (49.2 -1)
# 14 25.83853 [°C] 1993-08-01 1.541375 [m]     POINT (49.2 -1)
# 15 25.82974 [°C] 1993-08-01 2.645669 [m]     POINT (49.2 -1)
# 16 25.82168 [°C] 1993-08-01 3.819495 [m]     POINT (49.2 -1)
# 17 25.00061 [°C] 1993-08-01 0.494025 [m]       POINT (48 -5)
# 18 24.99841 [°C] 1993-08-01 1.541375 [m]       POINT (48 -5)
# 19 24.99622 [°C] 1993-08-01 2.645669 [m]       POINT (48 -5)
# 20 24.99255 [°C] 1993-08-01 3.819495 [m]       POINT (48 -5)

which I believe to be correct. The code is still somewhat shaky, allowing for just one extra dimension I think. We can't return a stars object because time and space share a single dimension: this is a 4 (depth) x 5 (space + time) "cube", so it's in long table form. Let me know if this works for you.

edzer avatar Oct 01 '25 12:10 edzer