ga4gh-schemas icon indicating copy to clipboard operation
ga4gh-schemas copied to clipboard

RFC: Should OntologyTerms be referred to exclusively by id?

Open reece opened this issue 8 years ago • 4 comments

The current OntologyTerm definition allows a record to store "id" and "term" fields. For SO, this would look like SO:0001631 and upstream_gene_variant respectively.

In SO "term" refers to a concept. The proper primary key for a SO term is a SO "id" (e.g., SO:0001147). Each term also has a "name" (natural_variant_site), "definition" ("Describes the natural sequence variants due to polymorphisms..."), and other fields. Names may change (and have changed) for a given SO id; therefore, developers should use SO ids internally. Furthermore, software developers have unfortunately taken liberty with rewording SO names, thereby causing them to drift from the authoritative term definitions.

The GA4GH OntologyTerm.term is effectively a free text field. This state is likely to lead to confusion and incomplete search results as SO names evolve (officially or otherwise).

Question: What should we do to make the use of SO terms more reliable? How do we make SO use more consistent while also permitting extensions?

reece avatar Apr 05 '16 17:04 reece

If we accept the premise that the use cases we ought to support are for data interchange, we ought to restrict the query endpoints to the minimum useful functionality. It makes sense to me, just as your demonstration showed, that if a client can do the mapping (without inordinate data transfer) then they should.

Say, for example, that one wants to use a different version of SO than what is provided by the results of SnpEff. This would entail writing a thin layer to map the synonyms from the new SO version to the old version for generating SO term queries. To make this process more clear to a developer one could add an sequenceOntologyVersion field to the VariantAnnotationSet record. We should be careful to avoid necessitating developing an ontology mapping/resolution service in order to perform data interchange use cases.

The schema currently states: "Exact matching across all fields of the Sequence Ontology OntologyTerm is required." This avoids the issue of mapping between separate releases of SO. However, there is no way to know ahead of time what the sourceVersion or sourceName ought to be.

tl;dr Add a field to the VariantAnnotationSet stating which version of SO is being used and simplify the query interface to only accept identifiers. It is up to the client to resolve those.

david4096 avatar Apr 05 '16 18:04 david4096

It seems if we clearly document that the OntologyTerm id is the SO term, then there is no need to enforce the consistency or long-term-ness of the term("name") field.

Searches should be done on the stable id field, the rest is the submitting tools provision to fill in to be useful to provide intent, represent current term value etc.

What should be done when no SO term (id) is available?

Re: @david4096 point on versioning. I notice the VariantAnnotationSet has an exteranal reference to Analysis. Is that a meta-data object purely about the annotations, and not about the variant calls? If so, then it should be sufficient in providing the providence data about what tools created each VariantAnnotation and hence the containing TranscriptEffect records. If not, I suggest some similar way to document the tools and data-sources used to generate a VariantAnnotationSet.

Finally, I feel the AnalysisResult is very particular to how Ensembl is doing this. Per-transcript analysis outputs are not necessarily single integer numeric values. Maybe I don't understand the structure. Is the result a value or the label for the score? (i.e. would result have a value of 'SIFT', and score have a value of 1?)

gaberudy avatar Apr 05 '16 21:04 gaberudy

Answering the last bit of my own question, Sara made it clear here https://github.com/ga4gh/server/issues/833#issuecomment-180320608

There is an Analysis record for the VariantAnnotationSet that should record package info of the annotation tool etc.

Similarly, the AnalysisResults record are potentially a numeric + categorical value, with examples like:

"analysisResults": [
{
  "analysisId": "ID_SIFT.5.2.2",
  "score": "0.43",
  "result": "tolerated"
},
{
  "analysisId": "ID_Polyphen.2.2.2_r405",
  "score": 0.012,
  "result": "benign"
}
]

I still think this seems very specific to Variant<->Transcript interaction predictions of a specific sort. But not sure if I have a specific recommendation. As it is, I expect the analysisResults record will be largely ignored / left empty.

gaberudy avatar Apr 06 '16 18:04 gaberudy

The BioCharacteristic object wraps around OntologyTerm lists (positive and negated):

  • https://github.com/ga4gh/ga4gh-schemas/blob/metadata-modify-biocharacteristics/src/main/proto/ga4gh/bio_metadata.proto#L140
// BioCharacteristic is a prototype wrapper object for single instances
// of phenotypes, diseases ... which may be described through one or several
// ontology terms
message BioCharacteristic {
  // A free text description of the specific disease diagnosis or phenotype
  // here, which is then characterized by zero or more OntologyTerm objects.
  // The description should be concise and should not include data points
  // better expressed through specific attributes elsewhere in the schema.
  // Example (for a single disease item):
  //   "squamous cell carcinoma, base of tongue, stage 2"
  string description = 1;

  // The ontologyTerms attribute contains a list of zero (discouraged) or more
  // OntologyTerm objects covering the characteristic (e.g. disease diagnosis,
  // phenotype) reorted here.
  // Example (for a single diagnosis "squamous cell carcinoma, base of tongue"):
  //
  //     term_id: "DOID:0050865",
  //     term: "tongue squamous cell carcinoma",
  //
  //     term_id: "UBERON:0006919",
  //     term: "tongue squamous epithelium",
  //
  //     term_id: "UBERON:0010033",
  //     term: "posterior part of tongue",
  //
  repeated OntologyTerm ontology_terms = 2;

  // negatedOntologyTerms are used to describe features which are explicitely
  // not part of the BioCharacteristic.
  // Example: For a phenotype
  //
  //      description: "Bilateral ventricle anomalies (but not hypertrophy)"
  //
  //   ...  one could use the ontologyTerms
  //
  //      term_id: "HP:0001711"
  //      term: "Abnormality of the left ventricle"
  //
  //      id: "HP:0001707"
  //      term: "Abnormality of the right ventricle"
  //
  //   ... and add to negatedOntologyTerms
  //
  //      term_id: "HP:0001714"
  //      term: "Ventricular hypertrophy"
  //
  repeated OntologyTerm negated_ontology_terms = 3;

  // Label for the logical scope of this BioCharacteristic. Typical examples
  // could be "phenotype", "disease", "observation".
  // TODO:  This may be modified into an enumeration or expressed through
  //        an OntologyTerm.
  string label  = 4;
}

mbaudis avatar Apr 11 '17 14:04 mbaudis