sssom
sssom copied to clipboard
Considerations on interoperability and evolution
What are the plans, if any, to ensure the SSSOM TSV format remains interoperable
- across versions (a file produced with an implementation conforming to version X can be used by an implementation conforming to version X+1),
- across implementations (a file produced by implementation X — e.g.
sssom-py
— can be used another implementation Y — e.g.sssom-java
)?
Currently, this interoperability goal is explicitly not sought after, as indicated both by the lower-than-one major version number (which, if we assume the project is using semantic versioning, means that anything can change at any time) and by the explicit warning on the top-level README: Note that SSSOM is currently under development and subject to change.
With the SSSOM format being seemingly more and more used in the wild, I believe it is time to consider committing to some form of long-term stability of the format, and/or design ways to make the format evolve while preserving some basic interoperability across versions and implementations.
The following is a random set of ideas that could be explored. Feel free to discuss them, refute them, and add more.
Defining a “core” set of metadata that will never change
We could select a handful of the most important mapping metadata and promote them to a “core set” that would be guaranteed never to change in any future evolution. This is similar to the “minimal spec” idea, though it could probably include slightly more metadata slots than the four mentioned in that ticket.
The principle here is that potential users could be confident that, no matter how the format evolves over time, as long as they only use the “core set” their files would always remain exploitable by any version of any conforming implementation. For users who need metadata from the “non-core set”, the situation would be the same as it is now for the entire standard: they would need to watch carefully the evolution of the standard to avoid being surprised by a breaking change.
For implementations, this would mean that they should only be strict when parsing the “core” metadata (at least by default — of course they can choose to allow users to specify a different behaviour). If they encounter a metadata slot they don’t recognise (because it’s an addition from a newer version of the spec), or a slot whose format has changed (e.g. because of a change such as the one envisioned here, they may log a warning but should not fail altogether to parse the file.
Allow each set to declare its own “must-understand” slots
This can be seen as a variation of the “core set” idea. Here, instead of having a fixed list of “core” metadata slots, the creators of a mapping set could define their own list of the slots they consider as critical.
For example, considering the following set (and assuming that the similarity_threshold
slot, proposed here, and the mapping_chain_intermediate
slot, proposed here, have been added to a later version of the spec):
# must-understand:
# - mapping_chain_intermediate
subject_id predicate_id predicate_modifier object_id mapping_justification similarity_threshold mapping_chain_intermediate
EXA:1234 skos:exactMatch Not EXB:5678 semapv:ManualMappingCuration 0.8 EXC:4321
An implementation trying to read that file would first check the list of the “must-understand” slots for any slot that it does not recognise, and should flatly reject the file if it does contain such a slot.
So, an implementation up-to-date with the latest version of the spec (and thus, which supports both mapping_chain_intermediate
and similarity_threshold
) would parse the file without any issue. An implementation that for whatever reason does not recognise mapping_chain_intermediate
(maybe because it has not been updated to catch up with the latest version of the spec yet) should immediately fail with an error.
Adding a slot for the spec version
We could add a simple sssom-version
metadata slot at the mapping set level, to indicate the version of the spec this set is conforming to. Ideally that slot would be required to be the very first slot listed in the metadata block, so that a parser could figure out immediately whether the file it is trying to read is using a version it supports.
It would be up to the implementations to decide whether they want to support several versions at the same time or not.
Versioning the format in addition to the spec
Regardless of whether we add a versioning slot, it could be useful to introduce a version number for the file format, distinct from the version number of the specification. Not all changes to the specification have an impact on the file format, so tracking the evolution of the spec separately from the evolution of the format would make sense.