Example datasets for different (ocean) circulation models
This is a continuation of the work in #1972
We currently have _datasets.[un]structured.generic providing generic example datasets for Parcels. These examples, however, are disconnected from the different ways that (ocean) circulation models output their data - as teams responsible for each circulation model makes their own decisions.
To support these various circulation models within Parcels, it's important that we (a) identify the circulation models that we want to support, and (b) provide dummy example datasets closely matching the format of the originals that we can use for testing.
These datasets will be defined in _datasets.[un]structured.circulation_models (as outlined in #1972) and they should be defined similarly to the datasets in _datasets.[un]structured.generic. i.e.,
- resolution defined by constants (e.g., in the structured case -
T,Z,Y,Xat the top of the file controlling time, x, y, z resolution) - field data just random data
- time domain is beginning 2000 to beginning 2001
- private functions which just return the datasets
- Note when defining these dummy datasets it's important to also specify important information about the original dataset. Which version of the model was the data generated with? (etc)
- expose a public
datasetsdictionary (the only difference here is that values here might be 2-tuples of xarray datasets in the case where the field and grid data are stored in separate netcdf files)
See the child issues below for structured/unstructured specific discussion. In these issues we discuss (a) which models we want to support in Parcels, (b) items to pay attention to when looking at these datasets (i.e., where circulation models differ from each other).
Providing a summary table of the datasets in the circulation_models.py docstring wrt. how the circulation models differ would be appreciated.
This is an MITgcm example dataset for a Cartesian rectilinear regional grid (baroclinic double gyre) that has 3 time levels in the netcdf file. The time is not managed with a calendar, and is instead referenced via the iteration count.
It'd probably be worth it to to have a few other examples:
- Curvilinear cartesian grid
- Spherical grid rectilinear
- Spherical grid curvilinear
- Examples where time domain is managed with the
calendarpackage in MITgcm
@VeckoTheGecko - I would vote for using analytical functions rather than random data in the fields. This is a bit more useful for validating interpolation methods. For example, if we use linear separable Fields, e.g. f = x*y*z, and we are providing 1st or 2nd order interpolation methods, the interpolation should be exact, which provides useful assertions when writing tests.
Nonlinear analytical functions are nice too; these would provide a means to run convergence tests to verify the interpolation schemes converge at the correct rate. This provides a much more robust indication that the methods we're providing out of the box are coded up correctly.
analytical functions...This is a bit more useful for validating interpolation methods.
Agreed that we'd need analytical functions for validating interpolation methods.
I would vote for using analytical functions rather than random data in the fields
How straight forward would this be? (I imagine the grid information would need to be accounted for here? This would be quite involved dataset generation no?)
Rather than modifying the raw generic datasets which are shared across the project, would we be able to either (a) provide specific examples in a datasets.structured.analytical, or (b) do this by patching the data?
Patching:
from parcels._datasets.structured.generic import datasets as datasets_structured
ds = datasets_structured["some_dataset"] # dataset has random data
ds["U"] = patch_with_analytical_field(ds["U"], equation="f = x*y*z") # dataset has same structure/attrs/etc, but now with analytical data
I think that this:
- is more extensible (i.e., can expand to non-linear functions)
- leaves the original datasets unchanged
- I imagine in a lot of tests we don't care about the actual data. For these test cases, the
genericmodule as is makes it quite easy to see the structure of the different datasets and including lots of code for analytical field generation would muddy things here I think.
- I imagine in a lot of tests we don't care about the actual data. For these test cases, the
(b) is probably much more work, might be easier to go with (a)
What would be a use case for random data ? Patching would still be good to allow for development of more sophisticated tests without reproducing a ton of boiler plate for each kind of analytical function we would want to target.
In the short term, I would say constant data would be more useful than random (if we can't interpolate constant data correctly, all bets are off).
What would be a use case for random data ?
Oh yes, no particular usecase for random data - just wanted something in there and the xgcm datasets had random data in there.
In the short term, I would say constant data would be more useful than random (if we can't interpolate constant data correctly, all bets are off).
Agreed. What do you have in mind?
- replace
np.randomwithnp.ones? - more spatially varying so you can do interesting interpolation?
- if this, perhaps create a new dataset as opposed to modifying existing? that way the existing datasets can stay a bit simpler
In response to https://github.com/OceanParcels/Parcels/issues/2004#issuecomment-2912516985
- Curvilinear cartesian grid
- Spherical grid rectilinear
- Spherical grid curvilinear
Yes, I think this would be good to put in datasets.structured.generic since they aren't specific to a certain model.
- Examples where time domain is managed with the
calendarpackage in MITgcm
Do you know if/how this differs from cftime?
This is an MITgcm example dataset for a Cartesian rectilinear regional grid (baroclinic double gyre) that has 3 time levels in the netcdf file. The time is not managed with a calendar, and is instead referenced via the iteration count.
I was thinking, and perhaps it would be nice to mimick publicly available datasets when we're making these test datasets. For this specific dataset - do you think it would differ significantly from, say, this dataset[^1] (i.e., differ in dimension ordering, metadata, organisation of the files, or something else I haven't considered? It would of course differ in grid/field values - not concerned much about that here). I think it would probably differ in calendar.
[^1]: Just looking for a quick guess here with your experience with this sort of data. I'll probably need to open it myself to poke around anyway :)
Agreed. What do you have in mind?
- replace
np.randomwithnp.ones?- more spatially varying so you can do interesting interpolation?
Replacing np.random with np.ones would be sufficient. Interpolation methods that can't get this right would be obvious red flags and would be quite useful. More "robust" tests could include overriding data values held by a field, but this is something that can be tackled after we have interpolation of constants sorted out.
Do you know if/how this differs from cftime?
When using the calendar package in MITgcm, the time dimension will come back as datetimes (I believe that is what's done to create the netcdf data you've shared above). In the example I've provided, the time dimension is just floats. This is quite common for folks who do idealized simulations in which case a calendar date for time tracking doesn't really make sense.
Every time I say something though, I find I'm proven wrong. Might pull up both datasets with xarray and compare :)
If y'all want, I can provide you ROMS fields in any configuration you need (cartesian/spherical/periodic); ROMS has a good set of test cases, so it is easy to whip up runs with arbitrary sizes and configurations.
The time in ROMS follows CF conventions, so will, if it is set up sensibly, load time into array with proper datetime configurations.
HOWEVER, in really big runs with lots of files, that can make xarray.open_dataset() very slow, and it is often useful to be able to pass in arguments to xarray.open_dataset() to turn of time processing, or specify which variables to join along, etc. This can make a very big difference in the time to open the files -- you can see that in reading Mercator ocean files.
Jamie
Hi @JamiePringle , thanks for the offer. The intention of this issue is actually just to match the structure of the datasets (rather than have actual datasets).
This task has been put on the back burner for the moment though (we're working on the internal representation in Parcels - we will build the functionality for conversion of these datasets into our internal representation at a later time).