parcels Supporting various circulation models and Parcels' internal representation of data

This issue is a mix of discussion on internal representation of data, and tasks required for supporting reading data from various circulation models.

Internal representation of data

We've decided on working internally with Xarray datasets, now an important matter to discuss is what these datasets actually look like internally[^3] (i.e., dimension ordering, how is depth defined, ...). (1) Are we working with the datasets in the form close to the model output? (i.e., close to the raw NetCDF files?) Or (2) are we working with datasets that match a certain internal representation defined by us?

@erikvansebille I recall from our group meetings that you were leaning towards (1), saying something along the lines of "allowing users to write interpolator methods while using the knowledge about their model" (please, if I don't fully understand your viewpoint or am missing something, comment below).

I am personally leaning towards (2) for the following reasons:

Its safer
- At the point of intialisation we can throw an informative error message if the dataset doesn't match our internal representation. This is easier to debug than runtime errors from failed index searching.
It makes our code simpler and more performant
- All our indexing methods can make assumptions according to our internal model (e.g., that the depth array is increasing)
- We need to do less checking before we do something
Makes testing easier as from FieldSet all the way through, we only need to test things in our internal data model (e.g., no need to test some edge case introduced only when model X is running pset.execute())
It's all lazy anyway
- Any dataset transformation done to get to this "internal state" would be done lazily via Dask
I feel that we can define an internal representation within parcels that encompasses all models that we want to support. Admittedly, I'm not 100% sure on whether my feeling here is well informed - I'm not intimately familiar with the model outputs we work with.
We already have some assumptions about internal representation with assuming that the data is ordered [tdim, zdim, ydim, xdim]
Allows interpolation methods to be portable between Fields (and allows us to ship them in Parcels)

A downside of (2) is (as you point out) that those writing interpolation methods need to understand how data is handled internally in Parcels as opposed to their model - that is something we would need good documentation for to support those writing interpolation methods.

Tasks required for supporting reading data from various circulation models

Sub-issues:

[ ] #2004
[ ] (2) Decide on an internal representation of data within parcels (or decide on how we are going to define search methods in a flexible way to work with different datasets)
- [DOC] Document this internal representation in the v4 docs (useful for those writing interpolators etc.)
- [^2] Define helper functions (in module parcels._datasets.(un)structured.coerce) to transform datasets from the original representation (i.e., circulation_model.py) to the parcels internal representation. These coersian functions can either be used as documentation for users (to see which transformations they need to do to their data to bring in line with Parcels), or be used in Public convenience methods such as FieldSet.from_...() which handle this automatically.
[ ] (3) What do indices correspond to? [Joe to write issue]
- Is it the f-points, or the t-points?
- how is VectorField indices done?
[ ] (4) Define further helper functions
- e.g., Field.from_cf_compliant(ds) which takes in a dataset with CF-compliant metadata and does all the transformations needed to bring it into parcels internal assumptions

[^3]: By here "internally" I mean at the point where the dataset is passed to the Field initialiser, and is stored on the data attribute (or similarly, passed to the Grid initialiser). This is the structure that the rest of Parcels can safely assume. From a user POV they don't necessarily need to do the data transformations themselves, this can be bridged using classmethods.

May 08 '25 14:05 VeckoTheGecko

Thanks for starting this discussion, @VeckoTheGecko! I agree this is an important choice to make. While I see that option 2) is indeed much safer and easier to maintain, I fear it will confuse users a lot. Most users are intimately knowledgeable about their model grid, and having them think about converting it to Parcels grid may be a big step. A good example are the CROCO sigma-grids

Can't we do something between option 1) and 2)? Have a robust set of interpolation methods for grids that are "Parcels-compliant" (do be defined), but also allow users to write their own interpolation schemes for Nono-compliant grids. We could raise warnings if grids are non-compliant, but we wouldn't throw an error.

That way we can keep the versatility that is one of the strengths of Parcels!

May 09 '25 06:05 erikvansebille

I think (1) is an end of a spectrum, and (2) would represent the rest of the spectrum. I think there would need to be some transformation done as data is read into Parcels (i.e., expanding fields to 4D arrays of (tdim, zdim, ydim, xdim) - transposing as required, converting measurements to m/s (as opposed to cm/s which at least one model uses), attaching metadata).

Where we land on the spectrum of (2) (ie., the extent to which we change things from the original netcdf representation) is up to us and I think will be informed by #2004 - I agree that it would be good to land in the middle and not change too much, but I do think we'll need to change some things to make things easier for us to work with the data internally.

May 09 '25 14:05 VeckoTheGecko

e.g., Field.from_cf_compliant(ds) which takes in a dataset with CF-compliant metadata and does all the transformations needed to bring it into parcels internal assumptions

Also might be of interest https://github.com/bcdev/xrlint , as well as cf_xarray

Jun 24 '25 15:06 VeckoTheGecko

TLDR: Let's stick close to the v3 API (from_model pointing at a mix of datasets)

When defining a hydrodynamic FieldSet we need to have the following information:

the axes coordinates (lons and lats of the f-points)
the Field (array) data, as well as which axes are in each Field
positioning of each of the axes in the Field with respect to the axes coordinates

Internally within Parcels this information is all well defined via Field and Grid objects (leveraging xgcm).

We have been exploring what it looks like in version 4 for users to construct FieldSets - bridging the gap between their input models and our internal representation. Without these helper methods, there is the need for quite some boilerplate.

As I see it, we have 2 options:

a. build an API close to the model output (this was done in version 3 of Parcels - where users used the type of model (from_nemo etc.) alongside filenames, variables and dimensions to provide information (1), (2), (3) above).
b. build an API around an intermediate representation of data created by the user (e.g., users load and combine xarray datasets and pass them through to Parcels) - this is an avenue that we have been discussing in the core team.

Difficulty with b

The difficulty with option (b) is that users still need to provide information surrounding (1), (2), (3) to Parcels alongside their combined xarray dataset. This could either be done by:

attaching information to the xarray dataset (attribute metadata, or specific naming of variables) to encode this information
- This requires us to use a convention like SGRID convention, or create our own convention (i.e., assuming variables have certain naming etc).
passing along keyword arguments
- This requires us to design a new API (perhaps new kwargs) - similar to what was done in v3 of Parcels

In this instance, and in my opinion, both creating enforcing our own convention/designing a different API in Parcels v4 will be an unnecessary burden to both users transitioning from v3 of Parcels, as well as users familiar with their models who need to convert to an intermediate representation.

If we are to ingest from an intermediate representation - then let's at least do it from datasets aligning with SGRID convention. This way we are more aligned with the community, and if users already have a dataset complying with SGRID conventions they can easily load it. SGRID parsing is now supported in xgcm v0.9.0 released last month[^1] - our vendored code does not have this functionality, we should implement #2193 .

A way forward

I don't see a significantly better API for ingesting FieldSets than what was implemented in v3 of Parcels (at least, not an API that is better enough to justify the cost to users in terms of migration[^2]).

I think that we should do the following:

Implement from_nemo, from_croco, from_mitgcm, from_pop, from_mom along the lines of the v3 API.
- Under the hood the files will be opened with correct Xarray configurations. Coordinate variables will be renamed appropriately, and the data will be combined to a single Xarray dataset. Then the fieldset will be created. This will be abstracted away from the user, and will emit log messages so that the user has more insight into what's happening under the hood. Users can also provide xarray datasets instead of file names (so they can load data from zarr/other backends if they need to).
- This will mean that users coming from v3 will have no migration cost for ingesting fieldsets
Implement from_conventions(ds) which takes an xarray dataset and assumes SGRID and CF conventions.

Internally in Parcels, once from_conventions(ds) is defined this can be the primary execution path where the other helper functions flow through.

I hope that this post hasn't been overly verbose.

[^1]: Thanks a lot @jbusecke for your efforts with the release there! 🎉 [^2]: @erikvansebille, @michaeldenes are there any particular painpoints that you have encountered with the version 3 API for ingesting fieldsets?

Sep 16 '25 12:09 VeckoTheGecko

I've stumbled across one oddity that may be relevant to this discussion. Glorysv12 data is available via copernicus . Under the hood this is NEMO data that is interpolated in such a way that all variables are cell center registered (even though NEMO is natively C-Grid). Alternatively, the model data on the ORCA grid is available via pydap (and consequently xarray).

I haven't looked in detail yet how the metadata in the provided data differs between Copernicus and the pydap/thredds endpoint, but to me it underscores (a little bit), that end users ought to be aware of the data they're feeding into Parcels. In this case, both datasets represent the "same" parent model run, but when access via copernicus, data is all cell center registered and the other would be on the native curvilinear c-grid. The hitch here I see is that someone could attempt a from_nemo with either of these and have a bad time if we're assuming from_nemo is C-grid registered fields.

Sep 17 '25 15:09 fluidnumerics-joe

@fluidnumerics-joe can you provide outputs of ncdump/Dataset.info() so that we can look at the metadata?

I think this is related to the discussion I had with @michaeldenes yesterday about Copernicus providing derivative datasets where they reinterpolate NEMO data onto an A grid - and users being unaware of this.

Sep 18 '25 05:09 VeckoTheGecko

@fluidnumerics-joe I think updating the documentation to emphasise that from_nemo (or from_mitgcm etc.) is only valid for native NEMO data would be enough. The Copernicus data, for example, is derived (or post-processed) from a NEMO model, but it's definitely not "NEMO model data". It would be a slippery slope if we start to natively support anything but the native model data.

Sep 18 '25 08:09 michaeldenes

Here's a list of hydrodynamic models that we would want to support in v4:

[x] FieldSet.from_copernicusmarine()
[ ] FieldSet.from_fesom2()
[ ] FieldSet.from_nemo()
[ ] FieldSet.from_croco()
[ ] FieldSet.from_hycom()

Nov 17 '25 15:11 erikvansebille