cosima-cookbook
cosima-cookbook copied to clipboard
Identify coordinates in database
The explorer uses a crude heuristic to identify coordinates so they can be hidden from view when selecting variables to load.
It would be better to identify coordinates from the metadata available when scanning data files and save this as a boolean in the ncvars
table.
xarray.Dataset
identifies coordinates in the .coords
attribute of a dataset, which is the minimum for identifying coordinate variables.
There are other variables that are not classified by xarray
as coordinates, but which should be. For example in the MOM outputs average_T1
, average_T2
, average_DT
and time_bounds
contain ancillary coordinate data and should be flagged as coordinates.
Bounds variables can be identified by appearing in the bounds attribute of another variable. This is done in splitvar
:
https://github.com/coecms/splitvar/blob/master/splitvar/utils.py#L75-L90
The other variables listed above are also present in the attribute of another variable, e.g
float salt(time, st_ocean, yt_ocean, xt_ocean) ;
salt:long_name = "Practical Salinity" ;
salt:units = "psu" ;
salt:valid_range = -10.f, 100.f ;
salt:missing_value = -1.e+20f ;
salt:_FillValue = -1.e+20f ;
salt:cell_methods = "time: mean" ;
salt:time_avg_info = "average_T1,average_T2,average_DT" ;
salt:coordinates = "geolon_t geolat_t" ;
salt:standard_name = "sea_water_salinity" ;
they can be identified by adapting the code from splitvar
which is trying to find all variables that another variable "depends on"
https://github.com/coecms/splitvar/blob/master/splitvar/splitvar.py#L226-L248
For the ice data variables like TLON
should be flagged as coordinates, and the logic above would also work, as they are listed as coordinates
attributes for other variables:
float hi(time, nj, ni) ;
hi:units = "m" ;
hi:long_name = "grid cell mean ice thickness" ;
hi:coordinates = "TLON TLAT time" ;
hi:cell_measures = "area: tarea" ;
hi:missing_value = 1.e+30f ;
hi:_FillValue = 1.e+30f ;
hi:cell_methods = "time: mean" ;
hi:time_rep = "averaged" ;
It might be that the coordinates
attribute should be a special case that is specifically searched for.
If proposal in #191 were taken up this would overlap this significantly
There is a lot of redundancy in recording the same dimensions/chunking in every NCFile
$ sqlite3 /g/data/ik11/databases/cosima_master.db
SQLite version 3.36.0 2021-06-18 18:36:39
Enter ".help" for usage hints.
sqlite> select count(*) from (select dimensions, chunking from ncvars) t;
8872212
sqlite> select count(*) from (select distinct dimensions, chunking from ncvars) t;
307
sqlite>
I think this plays into #191 : coordinates don't (generally, WRF is an exception) change with time. So it makes sense to store them in separate tables, give each a unique "grid" id and just associate the grid id with a variable.
This isn't so much schema breaking as schema exploding.
Note: we're not storing the actual size of dimensions currently. So a separate dimensions table make sense. cf-xarray
could be used to add X/lon
, Y/lat
, Z/depth
categorisation. So grids could be 2D/3D based on distinct
dimension id tuples.