iris icon indicating copy to clipboard operation
iris copied to clipboard

Common agreement on loading CF non-compliant NetCDF files

Open trexfeathers opened this issue 2 years ago • 4 comments

Iris needs a public statement on how it handles NetCDF files that deviate from the CF conventions. This will serve multiple benefits:

  • More certainty when discussing if/how Iris should load a particular file.
  • Clearer direction when developing the codebase.
  • Set user expectations.

Writing this statement will involve making some difficult decisions. A working group is tackling this now: @tkknight, @bjlittle, @lbdreyer, @pp-mo, @trexfeathers, @stephenworsley, @ESadek-MO, @scottrobinson02, @HGWright

Factors at play

  • More CF compliance means smoother collaboration between institutions, and Iris can play a part in raising awareness.
  • CF evolves over time, so may develop 'opinions' on things that previously didn't matter and invalidate older files.
  • The available tooling can make it difficult to address non-compliances in a file.
  • UX - being strict/verbose about CF compliance makes the user experience more awkward.
  • Iris has a place in the scientific Python community - people choose Iris / Xarray / raw netCDF4 / something else / for different purposes, and CF handling plays a part in that.
  • Continuing to work in the face of CF non-compliances could need more defensive code.

Items affected

(please edit if you know of others)

  • #5119
  • #5126
  • #5068
  • #5067
  • #5003
  • #4495
  • #1801
  • #5171
  • #4453
  • #5257
### Tasks
- [ ] https://github.com/SciTools/iris/issues/5068
- [ ] https://github.com/SciTools/iris/issues/5119

trexfeathers avatar Feb 20 '23 15:02 trexfeathers

Summary from working group conversations

2023-02-02, 2023-02-14, 2023-03-22

Note this issue is not intended as a debate, hence why it is not posted as a discussion. The below conversations took place in real time, with a group deliberately sized to aid decision making.

Outcome - our ideal implementation

When loading NetCDF files, Iris will load all CF-compliant elements. A container of non-compliant variables and attributes will be attached to the Cube(s).

Encourage users:

If this causes you problems, please reach out to us to see if we can collaborate on a solution.

Implementation considerations

  • How to contain things that can't be represented properly?
  • Associate things with Cubes or isolated in own list?
  • Activate behaviour with a FUTURE flag?

Working group summary comments

  • @trexfeathers: embrace imperfection, skipping non-compliances sounds good if warnings work.
  • @stephenworsley: CF compliance is a good aim, but can't always be expected.
  • @pp-mo: CF offers optional ways of doing things, Iris ought to do its best, but not insist. Discourage 'bad CF'.
  • @bjlittle: KISS. Make users' lives simple, don't be awkward.
  • @lbdreyer: we'll always break someone's workflow. Need a plan to help those who are left behind.
  • @scottrobinson02: spirit of compromise. Accept that going in.
  • @tkknight: KISS. Informative messages when things don't work.
  • @HGWright: if we can do something we should do something. Don't throw toys from pram. Make our actions clear.
  • @ESadek-MO: no easy solution, communicate well, focus on warnings.

Discussion topics

Encouraging compliance in the community

  • We know examples where Iris' strictness has resulted in more compliant - more interoperable - files.
  • CF is a convention, not a standard.
  • CF is the only available convention and is therefore used for anyone looking for help making files interoperable.
  • Iris' scope is wider than CF, and Iris doesn't implement all of CF. Need to avoid inventing our own rules.
  • CF's longevity is relevant.

Files changing from acceptable to unacceptable

  • While CF is intended to be backwards compatible, checks (within Iris, cf-checker, whatever) are not a complete implementation and may evolve over time, invalidating previously acceptable files.

Ease of massaging files to be compliant

  • Always going to be somewhat difficult.
  • If Iris can't cope with non-CF, then users forced onto another tool.
    • Could edit the file directly using ncedit or NetCDF4, but this can be challenging, and editing a copy may be unrealistic.
    • All the rich tools (Iris, Xarray, cf-python) have their own opinions.
    • ncdata has the potential to make this much easier.
  • Should Iris include a non-CF layer, lower than a Cube, to help with fixing?

User experience (UX)

  • Cannot be underestimated.
  • Undesirable to flatly refuse to load.
  • Need clarity on what Iris expects.
  • Need user education.
  • Warnings are an opportunity to encourage compliance and help, without 'being awkward'
    • Really important to not ruin UX with even more warnings.
    • Classify warnings? Allowing users granularity for what the care about / ignore?
  • CF brings some inevitable complexity, some user effort required.
  • Compromises are necessary.

Iris' place in the world

  • Interoperability allows using other, more tolerant tools.
  • Learning/adopting other tools is nevertheless not as good as getting everything from one place.
  • We should aim to avoid duplication within the geoscience community.

Ease of software development

  • Defensive code takes extra effort.
  • Iris could be written to work with things it doesn't explicitly understand.
  • API changes could make things easier:
    • Interchange between Cube and _DimensionalMetadata.
    • Easier construction of Cubes from scratch.
  • Might be easier to include user-level fixing tools in Iris, rather than making Iris cope better.

Preferred approaches

Determined via voting.

  1. Iris only loads CF compliant parts of file, skipping non-compliant (maybe raises warning?).
  2. Iris allows the user to configure how it will interpret malformed file.

trexfeathers avatar Jul 20 '23 14:07 trexfeathers

Oooh just discovered this issue via DragonTaming board @trexfeathers.

Sounds like you've got a fair bit of input from working group already; please shout though if useful to have more, as this is a particularly painful area for space weather - and we've got a good amount of requirements (ionosphere and lower) in the iris-o-sphere of traditional geographic lat/lon coords!

More context on why CF non-compliance an issue for space weather

Highly interested: space weather is not represented in CF conventions, so data wrangling is a key issue for us.

There's a few times where I've consciously decided not to go with iris due to anticipating "ugh, lots of pain handling I/O at boundaries due to data being inherently non-CF-compliant"

In retrospect, often this decision was bad:

  • I've ended up writing (and then having to support!) custom code - e.g. pseudo-geo-aware dataclasses & methods for ionospheric data - which ends up being a poorer version of iris.
  • I'd have been better served going for the real deal, and biting the boundaries-pain bullet.

Self-interestedly v happy to give more input if useful - help you help me!

edmundhenley-mo avatar Jul 01 '24 11:07 edmundhenley-mo