sssom icon indicating copy to clipboard operation
sssom copied to clipboard

Considerations on interoperability and evolution

Open gouttegd opened this issue 8 months ago • 17 comments

What are the plans, if any, to ensure the SSSOM TSV format remains interoperable

  • across versions (a file produced with an implementation conforming to version X can be used by an implementation conforming to version X+1),
  • across implementations (a file produced by implementation X — e.g. sssom-py — can be used another implementation Y — e.g. sssom-java)?

Currently, this interoperability goal is explicitly not sought after, as indicated both by the lower-than-one major version number (which, if we assume the project is using semantic versioning, means that anything can change at any time) and by the explicit warning on the top-level README: Note that SSSOM is currently under development and subject to change.

With the SSSOM format being seemingly more and more used in the wild, I believe it is time to consider committing to some form of long-term stability of the format, and/or design ways to make the format evolve while preserving some basic interoperability across versions and implementations.

The following is a random set of ideas that could be explored. Feel free to discuss them, refute them, and add more.

Defining a “core” set of metadata that will never change

We could select a handful of the most important mapping metadata and promote them to a “core set” that would be guaranteed never to change in any future evolution. This is similar to the “minimal spec” idea, though it could probably include slightly more metadata slots than the four mentioned in that ticket.

The principle here is that potential users could be confident that, no matter how the format evolves over time, as long as they only use the “core set” their files would always remain exploitable by any version of any conforming implementation. For users who need metadata from the “non-core set”, the situation would be the same as it is now for the entire standard: they would need to watch carefully the evolution of the standard to avoid being surprised by a breaking change.

For implementations, this would mean that they should only be strict when parsing the “core” metadata (at least by default — of course they can choose to allow users to specify a different behaviour). If they encounter a metadata slot they don’t recognise (because it’s an addition from a newer version of the spec), or a slot whose format has changed (e.g. because of a change such as the one envisioned here, they may log a warning but should not fail altogether to parse the file.

Allow each set to declare its own “must-understand” slots

This can be seen as a variation of the “core set” idea. Here, instead of having a fixed list of “core” metadata slots, the creators of a mapping set could define their own list of the slots they consider as critical.

For example, considering the following set (and assuming that the similarity_threshold slot, proposed here, and the mapping_chain_intermediate slot, proposed here, have been added to a later version of the spec):

# must-understand:
#   - mapping_chain_intermediate
subject_id	predicate_id	predicate_modifier	object_id	mapping_justification	similarity_threshold	mapping_chain_intermediate
EXA:1234	skos:exactMatch	Not	EXB:5678	semapv:ManualMappingCuration	0.8	EXC:4321

An implementation trying to read that file would first check the list of the “must-understand” slots for any slot that it does not recognise, and should flatly reject the file if it does contain such a slot.

So, an implementation up-to-date with the latest version of the spec (and thus, which supports both mapping_chain_intermediate and similarity_threshold) would parse the file without any issue. An implementation that for whatever reason does not recognise mapping_chain_intermediate (maybe because it has not been updated to catch up with the latest version of the spec yet) should immediately fail with an error.

Adding a slot for the spec version

We could add a simple sssom-version metadata slot at the mapping set level, to indicate the version of the spec this set is conforming to. Ideally that slot would be required to be the very first slot listed in the metadata block, so that a parser could figure out immediately whether the file it is trying to read is using a version it supports.

It would be up to the implementations to decide whether they want to support several versions at the same time or not.

Versioning the format in addition to the spec

Regardless of whether we add a versioning slot, it could be useful to introduce a version number for the file format, distinct from the version number of the specification. Not all changes to the specification have an impact on the file format, so tracking the evolution of the spec separately from the evolution of the format would make sense.

gouttegd avatar Oct 24 '23 14:10 gouttegd