airr-standards
airr-standards copied to clipboard
clean up ambiguous semantics of data_processing_id
When we introduced data_processing_id, it was meant to allow for multiple data processings to exist side-by-side, and we included statements like this:
It is expected that typical Repertoires might only have a single DataProcessing, in which case repertoire_id and data_processing_id will be semantically equivalent and only the former should be used.
However, this creates ambiguity for a data repository. Does this mean that they should only store data_processing_id if there are two or more? Can it be stored as a null value?
Furthermore, in the airr-schema spec we have this:
If this field is empty than the primary data processing object is assumed.
This causes further difficulty for data repositories as they have to determine if an arbitrary query is filtering data_processing_id or not, and if not then has to determine which data processing is the primary one, then presumably alter the query or filter to enforce it, that is exclude results from the non-primary data processings. This is impractical or cumbersome for arbitrary queries on the /rearrangement endpoint.
My suggestion is that we clean up this ambiguity, in particular.
-
State that data repositories must always have a
data_processing_idon rearrangement records, i.e. null is not allowed. This is enforced on the output from the API. Data repositories can store the data in their database however they want. -
Eliminate the requirements around the assumption of the primary data processing. Thus
data_processing_idwill be treated just like any other field in an ADC query, through a literal interpretation of the query filter.
This will affect clients of the ADC API, as they will be need to provide data_processing_id in their queries if they really want only the data within a specific data processing. That doesn't seem a big burden on them as they are probably doing that anyways, but the upside is that data repositories are relieved of the burden of treating data_processing_id in a special way.
This shouldn't affect tools directly. As before, we don't require tools to acknowledge repertoire_id, data_processing_id and so forth, it's optional. They can if they want, but if not then users just need to make sure that data is appropriately separated.
Philosophical question: Can two data sets that have been processed differently be the same Repertoire?
If Repertoire had a concrete biological meaning then I would say yes, but it doesn't...
@javh I mean, that came up on the call Monday --philosophical or not, I think we're going to have to address it.
@javh @scharch Yeah, I'm fine with going into that but let's do that in a separate ticket, I'd like to keep this issue focused on those specific statements that cause problems for data repositories. Open up a ticket on reimagining Repertoire, with hindsight and the new Clone and Cell objects, I certainly have my own ideas about that.
But the short answer is, depends what you mean by a Repertoire. It's the case right now that you need both repertoire_id and data_processing_id to collect the set of rearrangements to be analyzed together. If you only used repertoire_id then you would (potentially) be mixing rearrangements together that should not be mixed. Does that cause confusion? Probably, yes.
This issue would be made moot by PR #453 as the whole point of primary_annotation is to handle multiple data processing, and wouldn't be needed anymore if there was only a single one.
I think we're still expecting it to have a data_processing_id, though, right? #453 only moots the handling of primary_annotation...
See my comment on the pull request... This breaks compatibility on a number of things and I am not sure it makes sense to change DataProcessing until we sort out #441 as that is almost certainly going to entail another change. Does an intermediate change make sense we know more changes are going to emerge soon. It seems that we can probably live with the current situation in the short term until we sort out #441?
@schristley same with this? v2.1?
Cleared milestone as its unknown when/if data processing will be re-designed.