airr-standards icon indicating copy to clipboard operation
airr-standards copied to clipboard

management and conversion of schema changes for users

Open schristley opened this issue 3 years ago • 3 comments

The V1.4 Schema introduces some significant structural schema changes, e.g. single value fields turned into objects. These are significant enough that programs will get runtime errors if the code assumes one version or the other. I'm wondering, as a standards org, if we can be more user-friendly and proactive to help user manage these changes? Here are some user issues that I can imagine:

  • User wants a report to see what data/fields, which is in the old schema, might require manual intervention/conversion for the new schema.
  • User wants to automatically convert their data from the old schema to the new schema.
  • User might need to utilize tools in the same workflow that support different schema versions.

Right now, with the current design of the python/R libraries, they only operate on the current schema version.

  • Design the python/R libraries to support read/write/validate of multiple schema versions.
  • Maintain separate schema files for older version, e.g. airr-schema-v1.3.1.yaml. We'd need to decide when and how to do this.
  • Add conversion routines in the python/R libraries, with callbacks (i.e. call user-defined functions), so users can add custom code for conversion, i.e., translate values, do ontology lookups, etc.

Ideas/thoughts?

schristley avatar Feb 06 '22 18:02 schristley

  • Design the python/R libraries to support read/write/validate of multiple schema versions.
  • Maintain separate schema files for older version, e.g. airr-schema-v1.3.1.yaml. We'd need to decide when and how to do this.
  • Add conversion routines in the python/R libraries, with callbacks (i.e. call user-defined functions), so users can add custom code for conversion, i.e., translate values, do ontology lookups, etc.

Definitely yes for 1 and 3; I think that probably requires 2, as well, but it's a little more opaque to me.

scharch avatar Feb 06 '22 20:02 scharch

For 1-2, I'm concerned about the effort we'll have to dedicate to supporting multiple schema versions in the reference libraries. It'll make the code messier and harder to maintain, depending upon how large the changes are. We did this with changeo, so we could support both the AIRR schema and the old Change-O schema. It works, but it's a huge pain and I kind of regret it, even though it's mostly just column renames.

The docs, schema, and R/python libraries are all tied to the git repo tags, so we could setup a v1.3 maintenance branch for patches to v1.3 everything. Thus, continuing to support v1.3 as needed, without having to add support for older versions to the current libraries and without having to maintain multiple schema files in master.

3 seems like a good idea to me. If we want people to make the switch, then we should enable them to do so. In retrospect, this is what I wish we did with changeo and all the R packages. Ie, swapped over to the AIRR Rearrangement schema natively, made a Change-O to AIRR conversion script, and called it good.

javh avatar Feb 07 '22 17:02 javh

For 1-2, I'm concerned about the effort we'll have to dedicate to supporting multiple schema versions in the reference libraries. It'll make the code messier and harder to maintain, depending upon how large the changes are.

I'm hoping very little but it's not effortless. I don't think the libraries ever reference individual fields like analysis tools do. The current validation code is already general enough; it's been handling schema changes so far without needing to be modified. One exception being better support for allOf (#494 ).

Here's what I think is needed.

  • functions and command line tool will need an optional parameter for specifying the schema version
  • cache schemas by version
  • with consistent naming, the schema filename can be constructed from the version
  • maybe an extra function/command line with lists the available schema versions.

The docs, schema, and R/python libraries are all tied to the git repo tags, so we could setup a v1.3 maintenance branch for patches to v1.3 everything. Thus, continuing to support v1.3 as needed, without having to add support for older versions to the current libraries and without having to maintain multiple schema files in master.

There actually is one already because of some backporting for the ADC API. No tags have been created yet though.

3 seems like a good idea to me. If we want people to make the switch, then we should enable them to do so. In retrospect, this is what I wish we did with changeo and all the R packages. Ie, swapped over to the AIRR Rearrangement schema natively, made a Change-O to AIRR conversion script, and called it good.

I believe the issue is that pip (for python) only allows one version of the AIRR library to be installed? Can python import both AIRR V1.3 and AIRR V1.4 into the same running program?

schristley avatar Feb 09 '22 22:02 schristley