What does `st_extract` of the stars R package do when you have an extra dimension?
I have read in to R as a stars object some gridded temperature data (from Copernicus) that has the following form, including 4 depth layers:
> thetao
stars object with 4 dimensions and 1 attribute
attribute(s), summary of first 1e+05 cells:
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
thetao [°C] 14.89434 22.29057 24.02499 23.2406 24.85925 26.16007 88129
dimension(s):
from to offset delta refsys values x/y
x 1 973 19.46 0.08333 WGS 84 NULL [x]
y 1 721 30.04 -0.08333 WGS 84 NULL [y]
depth 1 4 NA NA udunits [0.494025,1.541375) [m],...,[3.819495,4.993321) [m]
time 1 12 NA NA POSIXct 1993-01-01,...,1993-12-01
I also have a space-time point dataset for which I want to extract the mean temperature values over all depth layers. This layer looks as follows:
> dd
Simple feature collection with 5 features and 1 field
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 48 ymin: -5 xmax: 55.7 ymax: 2.7
Geodetic CRS: WGS 84
yrmon the_geom
1 1993-05-01 POINT (55.7 0.85)
2 1993-05-01 POINT (55.55 -0.35)
3 1993-08-01 POINT (50.75 2.7)
4 1993-08-01 POINT (49.2 -1)
5 1993-08-01 POINT (48 -5)
To extract the temperature data, I tried using st_extract with the 4 depth layers all at once thinking it would magically just average over depth layers, but when I compared this to slicing out individual depth layers, extracting values, and then averaging the 4 values, I got a very different result. Can someone explain what st_extract is doing in the two cases, and which is correct?
The specific code I ran to compare the two was:
x = st_extract(thetao,dd,time_column="yrmon")
for (i in 1:4) {
t = slice(thetao,"depth",i)
xx = st_extract(t,dd,time_column="yrmon")
x[paste0("thetao",i)]=xx$thetao
}
x$thetaomn = (x$thetao1 + x$thetao2 + x$thetao3 + x$thetao4)/4
The result of this was:
> x
Simple feature collection with 5 features and 8 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 48 ymin: -5 xmax: 55.7 ymax: 2.7
Geodetic CRS: WGS 84
thetao time yrmon the_geom thetao1 thetao2 thetao3 thetao4 thetaomn
1 27.32905 [°C] 1993-05-01 1993-05-01 POINT (55.7 0.85) 29.85672 [°C] 29.77908 [°C] 29.72781 [°C] 29.70510 [°C] 29.76717 [°C]
2 27.31513 [°C] 1993-05-01 1993-05-01 POINT (55.55 -0.35) 29.79519 [°C] 29.72854 [°C] 29.68166 [°C] 29.65896 [°C] 29.71609 [°C]
3 27.42647 [°C] 1993-08-01 1993-08-01 POINT (50.75 2.7) 25.90078 [°C] 25.89859 [°C] 25.89492 [°C] 25.89126 [°C] 25.89639 [°C]
4 27.42940 [°C] 1993-08-01 1993-08-01 POINT (49.2 -1) 25.84365 [°C] 25.83853 [°C] 25.82974 [°C] 25.82168 [°C] 25.83340 [°C]
5 28.39988 [°C] 1993-08-01 1993-08-01 POINT (48 -5) 25.00061 [°C] 24.99841 [°C] 24.99622 [°C] 24.99255 [°C] 24.99695 [°C]
As the values are very different, clearly one of these is very wrong, but which?
Note that I am aware that I can just average the raster itself over depth and then apply st_extract, but I still want to understand what st_extract is doing when there are extra dimensions as I can imagine it might permit me to do more complex operations that aren't as simple as just averaging the 4 depth layers. If not, I would hope there would be a warning explaining that there are extra dimensions that are being handled in some way.
The specific bash code used to download the copernicus data was:
minlon=19.5
maxlon=100.5
minlat=-30
maxlat=30
copernicusmarine subset -i "cmems_mod_glo_phy_my_0.083deg_P1M-m" \
-x ${minlon} -X ${maxlon} -y ${minlat} -Y ${maxlat} \
-t 1993-01-01 -T 1993-12-31 \
-z 0 -Z 5 \
--variable mlotst --variable thetao \
-o ./ \
-f indian.1993.nc
And some code to create at least the start of the point data is:
library(tidyverse)
library(sf)
dd = data.frame(
yrmon = as.Date(c("1993-05-01","1993-05-01","1993-08-01","1993-08-01","1993-08-01")),
lon = c(55.7,55.55,50.75,49.2,48),
lat = c(0.85,-0.35,2.7,-1,-5)
) |>
st_as_sf(coords=c("lon","lat"),sf_column_name="the_geom")
Thanks for the clear problem description!
st_extract(thetao[,,,1], dd, time_column = "yrmon")
gives you the correct values for the first depth, but doing this for all depths (your example) seems completely off.
This fix brings us this:
library(tidyverse)
# ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
# ✔ dplyr 1.1.4 ✔ readr 2.1.5
# ✔ forcats 1.0.0 ✔ stringr 1.5.1
# ✔ ggplot2 3.5.2 ✔ tibble 3.3.0
# ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
# ✔ purrr 1.1.0
# ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
# ✖ dplyr::filter() masks stats::filter()
# ✖ dplyr::lag() masks stats::lag()
# ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(sf)
# Linking to GEOS 3.12.2, GDAL 3.11.3, PROJ 9.4.1; sf_use_s2() is TRUE
library(stars)
# Loading required package: abind
(thetao = read_mdim("indian.1993.nc"))
# stars object with 4 dimensions and 1 attribute
# attribute(s), summary of first 1e+05 cells:
# Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
# thetao [°C] 21.13916 24.27915 25.64296 25.49449 26.74236 29.99881 19669
# dimension(s):
# from to offset delta refsys
# longitude 1 973 19.46 0.08333 WGS 84 (CRS84)
# latitude 1 721 -30.04 0.08333 WGS 84 (CRS84)
# depth 1 4 NA NA udunits
# time 1 12 NA NA Date
# values x/y
# longitude NULL [x]
# latitude NULL [y]
# depth [0.494025,1.541375) [m],...,[3.819495,4.993321) [m]
# time 1993-01-01,...,1993-12-01
dd = data.frame(
yrmon = as.Date(c("1993-05-01","1993-05-01","1993-08-01","1993-08-01","1993-08-01")),
lon = c(55.7,55.55,50.75,49.2,48),
lat = c(0.85,-0.35,2.7,-1,-5)
) |>
st_as_sf(coords=c("lon","lat"),sf_column_name="the_geom", crs = st_crs(thetao))
dd
# Simple feature collection with 5 features and 1 field
# Geometry type: POINT
# Dimension: XY
# Bounding box: xmin: 48 ymin: -5 xmax: 55.7 ymax: 2.7
# Geodetic CRS: WGS 84 (CRS84)
# yrmon the_geom
# 1 1993-05-01 POINT (55.7 0.85)
# 2 1993-05-01 POINT (55.55 -0.35)
# 3 1993-08-01 POINT (50.75 2.7)
# 4 1993-08-01 POINT (49.2 -1)
# 5 1993-08-01 POINT (48 -5)
x = st_extract(thetao, dd)
all.equal( as.vector(units::drop_units(x[,,1,1][[1]])), units::drop_units(st_extract(thetao[,,,1,1], dd)[[1]]))
# [1] TRUE
all.equal( as.vector(units::drop_units(x[,,1,3][[1]])), units::drop_units(st_extract(thetao[,,,1,3], dd)[[1]]))
# [1] TRUE
all.equal( as.vector(units::drop_units(x[,,3,8][[1]])), units::drop_units(st_extract(thetao[,,,3,8], dd)[[1]]))
# [1] TRUE
match(dd$yrmon, st_get_dimension_values(thetao, "time"))
# [1] 5 5 8 8 8
(x2 = st_extract(thetao[,,,1], dd, time_column = "yrmon"))
# Simple feature collection with 5 features and 3 fields
# Geometry type: POINT
# Dimension: XY
# Bounding box: xmin: 48 ymin: -5 xmax: 55.7 ymax: 2.7
# Geodetic CRS: WGS 84 (CRS84)
# thetao time yrmon the_geom
# 1 29.85672 [°C] 1993-05-01 1993-05-01 POINT (55.7 0.85)
# 2 29.79519 [°C] 1993-05-01 1993-05-01 POINT (55.55 -0.35)
# 3 25.90078 [°C] 1993-08-01 1993-08-01 POINT (50.75 2.7)
# 4 25.84365 [°C] 1993-08-01 1993-08-01 POINT (49.2 -1)
# 5 25.00061 [°C] 1993-08-01 1993-08-01 POINT (48 -5)
all.equal(x[[1]][cbind(1:5,1,c(5,5,8,8,8))], x2$thetao) # TRUE
# [1] TRUE
x3 = st_extract(thetao, dd, time_column = "yrmon")
print(x3, n = 20)
# Simple feature collection with 20 features and 3 fields
# Geometry type: POINT
# Dimension: XY
# Bounding box: xmin: 48 ymin: -5 xmax: 55.7 ymax: 2.7
# Geodetic CRS: WGS 84 (CRS84)
# thetao yrmon depth the_geom
# 1 29.85672 [°C] 1993-05-01 0.494025 [m] POINT (55.7 0.85)
# 2 29.77908 [°C] 1993-05-01 1.541375 [m] POINT (55.7 0.85)
# 3 29.72781 [°C] 1993-05-01 2.645669 [m] POINT (55.7 0.85)
# 4 29.70510 [°C] 1993-05-01 3.819495 [m] POINT (55.7 0.85)
# 5 29.79519 [°C] 1993-05-01 0.494025 [m] POINT (55.55 -0.35)
# 6 29.72854 [°C] 1993-05-01 1.541375 [m] POINT (55.55 -0.35)
# 7 29.68166 [°C] 1993-05-01 2.645669 [m] POINT (55.55 -0.35)
# 8 29.65896 [°C] 1993-05-01 3.819495 [m] POINT (55.55 -0.35)
# 9 25.90078 [°C] 1993-08-01 0.494025 [m] POINT (50.75 2.7)
# 10 25.89859 [°C] 1993-08-01 1.541375 [m] POINT (50.75 2.7)
# 11 25.89492 [°C] 1993-08-01 2.645669 [m] POINT (50.75 2.7)
# 12 25.89126 [°C] 1993-08-01 3.819495 [m] POINT (50.75 2.7)
# 13 25.84365 [°C] 1993-08-01 0.494025 [m] POINT (49.2 -1)
# 14 25.83853 [°C] 1993-08-01 1.541375 [m] POINT (49.2 -1)
# 15 25.82974 [°C] 1993-08-01 2.645669 [m] POINT (49.2 -1)
# 16 25.82168 [°C] 1993-08-01 3.819495 [m] POINT (49.2 -1)
# 17 25.00061 [°C] 1993-08-01 0.494025 [m] POINT (48 -5)
# 18 24.99841 [°C] 1993-08-01 1.541375 [m] POINT (48 -5)
# 19 24.99622 [°C] 1993-08-01 2.645669 [m] POINT (48 -5)
# 20 24.99255 [°C] 1993-08-01 3.819495 [m] POINT (48 -5)
which I believe to be correct. The code is still somewhat shaky, allowing for just one extra dimension I think. We can't return a stars object because time and space share a single dimension: this is a 4 (depth) x 5 (space + time) "cube", so it's in long table form. Let me know if this works for you.