airr-standards icon indicating copy to clipboard operation
airr-standards copied to clipboard

Genotype and MHCGenotype are too deeply nested

Open javh opened this issue 2 years ago • 21 comments
trafficstars

Both Genotype and MHCGenotype are nested more deeply than typical for the schema. MHCGenotype.mhc_alleles and and Genotype.*_alleles could be pulled out into top level objects with refs in Genotype and MHCGenotype.

javh avatar Jan 23 '23 19:01 javh

Both Genotype and MHCGenotype are nested more deeply than typical for the schema.

I had the same sense too, it's little things like for example instead of this in Subject

        genotype:
            type: object
            nullable: true
            description: Genotype for this subject, if known
            properties:
                receptor_genotype_set:
                    nullable: true
                    $ref: '#/GenotypeSet'
                    description: Immune receptor genotype set for this subject.
                mhc_genotype_set:
                    nullable: true
                    $ref: '#/MHCGenotypeSet'
                    description: MHC genotype set for this subject.

just name them two different fields in Subject

                receptor_genotype_set:
                    nullable: true
                    $ref: '#/GenotypeSet'
                    description: Immune receptor genotype set for this subject.
                mhc_genotype_set:
                    nullable: true
                    $ref: '#/MHCGenotypeSet'
                    description: MHC genotype set for this subject.

schristley avatar Jan 23 '23 21:01 schristley

Making receptor_genotype_set and mhc_genotype_set independent properties of Subject makes sense to me as these are independent objects.

I am less sure of the logic behind splitting documented_alleles, undocumented_alleles and deleted_genes into high-level objects., These are inherent related components of the genotype. Is there a need to reduce the depth of nesting to a specific level? If so could you articulate the reason please, so that we can consider options.

williamdlees avatar Jan 24 '23 09:01 williamdlees

The reason is mostly stylistic. In the rest of the schema, objects live in other objects via $ref: not by nesting the definition. They would still be inherent related components if their definition was at the top level. But, it should be easier to validate if we use the same style throughout.

Aside from that, we might be able to apply some abstraction to documented_alleles, undocumented_alleles, and deleted_genes. Not sure on that. We'd have to discuss.

javh avatar Jan 24 '23 18:01 javh

Thanks Jason. As with the other issues I've commented on this morning, this feels like a restriction on validation that we have to live with because the amount of resource we can put in to the validation code is limited. I think it will become problematic as, inevitably, the schema objects become more complex. Perhaps again we should log it as restriction that we would like to overcome in the future, and for the time being I can change the Germline objects to live with it.

williamdlees avatar Jan 25 '23 10:01 williamdlees

Thanks, @williamdlees. I think for this particular issue, we can just focus on the style fix and maybe the allele abstraction question. The validation should follow from that. If the style is right, the validation should be fine, but, if not, then we'll need to fix the validation.

javh avatar Jan 25 '23 17:01 javh

From the call:

  • De-nesting Germline schema: De-nesting shouldn't break anything, so we can include this is v1.4.2.
  • documented_alleles, undocumented_alleles, deleted_genes are sufficiently distinct to leave them as separate objects when de-nesting.
  • undocumented_alleles.allele_name should be renamed to label, but we'll need to check if this breaks backwards compatibility in the ADC and therefore needs to wait until v2.0 (likely yes).
  • Subject.genotype: Fine to de-nest by moving receptor_genotype_set and mhc_genotype_set up a level. However, this will break backwards compatibility so it needs to wait until v2.0. Additional issues to address in v2.0: (1) should these be arrays and (2) should they be in Repertoire instead of Subject (as they are downstream from DataProcessing).

javh avatar Mar 20 '23 20:03 javh

  • undocumented_alleles.allele_name should be renamed to label, but we'll need to check if this breaks backwards compatibility in the ADC and therefore needs to wait until v2.0 (likely yes).

@javh Can you deprecate and add label in with v1.4.2? That way I can start using label now.

schristley avatar Apr 15 '23 19:04 schristley

Deprecation and changing a label breaks backwards compatibility so it shouldn't be part of v1.4.2, no? Any field name/type changes shouldn't be a part of a minor release...

bcorrie avatar Apr 17 '23 18:04 bcorrie

Deprecation and changing a label breaks backwards compatibility so it shouldn't be part of v1.4.2, no? Any field name/type changes shouldn't be a part of a minor release...

Right, yeah minor release should just be software and minor fixes. Could be a v1.5.x then. I wasn't sure if saying v2.0 because the change required a new version of MiAIRR, or there's just an expectation to get v2.0 released before a v1.5...

schristley avatar Apr 17 '23 18:04 schristley

  • De-nesting Germline schema: De-nesting shouldn't break anything, so we can include this is v1.4.2.

As long as the "path" to the object doesn't change - which de-nesting should not do - then things should be fine.

So subject.genotype.mhc_genotype_set.mhc_genotype_list.mhc_class should not change after the de-nesting.

The above is how one queries for this field, so if the path changes then we are breaking compatibility...

A query using the API looks like this:

{
  "filters": {
    "op":"=",
    "content": {
      "field":"subject.genotype.mhc_genotype_set.mhc_genotype_list.mhc_class",
      "value":"MHC-I"
    }
  }
}

So this query should work after any changes:

curl -d '{"filters":{"op":"=", "content": { "field":"subject.genotype.mhc_genotype_set.mhc_genotype_list.mhc_class", "value":"MHC-I"}}}' https://t1d-1.ireceptor.org/airr/v1/repertoire

bcorrie avatar Apr 17 '23 18:04 bcorrie

From the call:

  • Create a new issue for the label and receptor_genotype_* field renames for milestone v1.5.0.

javh avatar Apr 17 '23 18:04 javh

  • De-nesting Germline schema: De-nesting shouldn't break anything, so we can include this is v1.4.2.

As long as the "path" to the object doesn't change - which de-nesting should not do - then things should be fine.

The term de-nesting seems to be used in two contexts. For Subject.genotype, the de-nesting is changing the "path"

So subject.genotype.mhc_genotype_set.mhc_genotype_list.mhc_class --> subject.mhc_genotype_set.mhc_genotype_list.mhc_class and thus queries would change.

The other de-nesting is making stylistic changes to the schema, e.g. #669 and these shouldn't affect the "path"

schristley avatar Apr 17 '23 19:04 schristley

So where do we stand on this for v2.0?

  • Subject.genotype: Fine to de-nest by moving receptor_genotype_set and mhc_genotype_set up a level. However, this will break backwards compatibility so it needs to wait until v2.0. Additional issues to address in v2.0: (1) should these be arrays and (2) should they be in Repertoire instead of Subject (as they are downstream from DataProcessing).

Personally, I am not sure I understand why we want to move these "up a level". Logically they are both genotype of a subject, so why do we not want to have them represented that way? This is a minor detail, and I don't really care that much, but it seems odd that this is viewed as "better". For v2.0 I would lean toward keeping the subject.genotype but I am not highly invested either way.

bcorrie avatar Feb 08 '24 20:02 bcorrie

  • Subject.genotype: Fine to de-nest by moving receptor_genotype_set and mhc_genotype_set up a level. However, this will break backwards compatibility so it needs to wait until v2.0. Additional issues to address in v2.0: (1) should these be arrays and (2) should they be in Repertoire instead of Subject (as they are downstream from DataProcessing).

It doesn't make sense to me to have subject genotype logically stored only at the repertoire level since biologically they are related to the subject. We feel strongly that we need the ability to store both MHC and Receptor genotype at the subject level if they are known. That is what these fields are for. If you don't know subject level genotype you don't have to store anything!

I admit our experience is more with MHC Genotype for the studies that are looking at MHC/peptide binding, but it is always subject specific so it would not make any sense to group it with a repertoire.

It is my understanding that in the ideal world, we would know the Subject receptor_genotype_set and that is what would be used for sequence annotation to get the most accurate annotation of a Repertoire for that subject. The experimenter can store their best estimate of what the receptor_genotype_set is for this subject if it has been determined. To me that is the intent of these subject MHC and Receptor Genotype fields.

I am pretty sure I have suggested this before, but if we need the ability to store a receptor_genotype_set as inferred from a Repertoire with that Repertoire, and that can not be considered the Subject receptor_genotype_set then I would suggest we need another field. Maybe another data collection as @schristley suggests, It is this type of receptor_genotype_set that is downstream of DataProcessing (inference_process = repertoire_sequencing). I am not saying this isn't needed...

But that should not detract from the fact that if the experimenter has determined a Subject receptor_genotype_set (e.g. using inference_process = genomic_seqeuncing) they should have the ability to store that with the Subject.

bcorrie avatar Feb 08 '24 21:02 bcorrie

So subject.genotype.mhc_genotype_set.mhc_genotype_list.mhc_class --> subject.mhc_genotype_set.mhc_genotype_list.mhc_class and thus queries would change.

Up one level. So still a property of Subject as per @schristley's comment above

scharch avatar Feb 08 '24 21:02 scharch

So where do we stand on this for v2.0?

  • Subject.genotype: Fine to de-nest by moving receptor_genotype_set and mhc_genotype_set up a level. However, this will break backwards compatibility so it needs to wait until v2.0. Additional issues to address in v2.0: (1) should these be arrays and (2) should they be in Repertoire instead of Subject (as they are downstream from DataProcessing).

Personally, I am not sure I understand why we want to move these "up a level". Logically they are both genotype of a subject, so why do we not want to have them represented that way? This is a minor detail, and I don't really care that much, but it seems odd that this is viewed as "better". For v2.0 I would lean toward keeping the subject.genotype but I am not highly invested either way.

"better" only in that the data structure is simpler, and yes it is a minor detail about syntax. The semantics aren't in question. I would ask the reverse question, how are these two field paths:

subject.genotype.mhc_genotype_set.mhc_genotype_list.mhc_class subject.genotype.receptor_genotype_set.genotype_class_list.locus

better than these two field paths?

subject.mhc_genotype_set.mhc_genotype_list.mhc_class subject.receptor_genotype_set.genotype_class_list.locus

There doesn't seem to be any specific advantage to the deeper structure with the longer field path.

In other place in the schema, we try to only create these composite objects when it creates a meaningful composite. As William indicated, these two concepts are independent enough that composing them into genotype doesn't seem to add anything.

schristley avatar Feb 08 '24 22:02 schristley

Ultimately any genotype is going to be derived from experimental data and may, to a greater or lesser extent, be partial. I think it’s important that the underlying data is explicitly referenced. This could be a reference to one or more repertoires, and/or to genomic sequencing, or even a snp array perhaps, together with some indication of the protocol and scope. I’m ok with the idea of storing all genotypes at the subject level. If a genotype has been used to annotate a repertoire, the repertoire should point back to the genotype that was used. I think we should try to avoid any implication that a particular repertoire is comprehensive or otherwise and stick to the description of how it was obtained, as standards will continue to change over time.

javh edit: remove email headers.

williamdlees avatar Feb 08 '24 22:02 williamdlees

apologies, I meant to say “any implication that a particular genotype is comprehensive…”, not “any particular repertoire”.

javh edit: remove email headers.

williamdlees avatar Feb 08 '24 22:02 williamdlees

subject.genotype.mhc_genotype_set.mhc_genotype_list.mhc_class subject.genotype.receptor_genotype_set.genotype_class_list.locus

better than these two field paths?

subject.mhc_genotype_set.mhc_genotype_list.mhc_class subject.receptor_genotype_set.genotype_class_list.locus

There doesn't seem to be any specific advantage to the deeper structure with the longer field path.

I don't see the need for the change - which requires work 8-) It is not insignificant work in that it breaks backwards compatibility. AIRR v1 and AIRR v2 files with genotype will no longer be compatible. Repositories need to change, the Gateway needs to change, etc.

I also feel that logically they belong together since they are both genotype. I will bow to your wisdom in terms of whether biologically this is an important concept that is worthy of capturing in the schema. It was my understanding that subject genotype was important.

You can see the redundancy in the naming, which you don't really need in either case - e.g. you could have: subject.genotype.mhc_set.class_list.class subject.genotype.receptor_set.class_list.locus

Bottom line is I am fine however the group want to do this (I have made my case). I would suggest that you clean up some of the redundancy in the naming in either case when you make the changes.

bcorrie avatar Feb 09 '24 18:02 bcorrie

subject.genotype.mhc_genotype_set.mhc_genotype_list.mhc_class subject.genotype.receptor_genotype_set.genotype_class_list.locus better than these two field paths? subject.mhc_genotype_set.mhc_genotype_list.mhc_class subject.receptor_genotype_set.genotype_class_list.locus There doesn't seem to be any specific advantage to the deeper structure with the longer field path.

I don't see the need for the change - which requires work 8-) It is not insignificant work in that it breaks backwards compatibility. AIRR v1 and AIRR v2 files with genotype will no longer be compatible. Repositories need to change, the Gateway needs to change, etc.

Yeah, it's already "out in the wild". If not I'd push to clean it up but now I'm fine with leaving it as the churn isn't likely worth it. Structurally and semantically it is fine, just a little verbose. Besides, it wouldn't be a real spec without a few quirks! ;-D

schristley avatar Feb 09 '24 19:02 schristley

So have we decided to leave this as is for AIRR 2.0?

bcorrie avatar Aug 12 '24 19:08 bcorrie

From the call:

  • Not ideal, but let's leave it as is because needless effort.

javh avatar Sep 09 '24 18:09 javh