Chado icon indicating copy to clipboard operation
Chado copied to clipboard

feature_cvterm vs feature_dbxref vs featureprop for feature annotations

Open bradfordcondon opened this issue 5 years ago • 3 comments

Hello,

@mpoelchau and myself have been discussing the behavior of storing GFF files for feature annotations via Tripal. We are considering a gene that perhaps has been annotated with GO terms, KEGG terms, proposed PFAM domains, and Interproscan family annotations.

My understanding of the Chado tables (which i want to emphasize is up for debate) is:

  • feature_cvterm is for annotating features with all of the cases I described above (GO, KEGG, PFAM) because some decision was made based on computational evidence to associate the feature with that annotation. The feature_cvtemrprop table exists to store evidence codes, qualifiers, etc.
  • feature_dbxref is for storing references to that record, itself, in anotehr database. So it should only be used to link back to the feature itself on a different site. Gene families its a part of, for example, wouldnt belong here.
  • featureprop: its hard for me to distinguish when a term annotation is better suited as a featureprop. props can have pubs for evidence but theres no featurepropprop table for evidence codes. Also, the "value" field seldom may not make sense if tagging with an annotation.

I'll add this is the most definitive guidance i found in my search on the chado wiki in the sequence module manual

Detailed annotations, such as associations to Gene Ontology (GO) terms or Cell Ontology terms, can be attached to features using the feature_cvterm linking table. This allows multiple ontology terms to be associated with each feature. Provenance data can be attached with the feature_cvtermprop and feature_cvterm_dbxref higher-order linking tables. It is up to the curation policy of each individual Chado database instance to decide which kinds of features will be linked using feature_cvterm. Some may link terms to gene features, others to the distinct gene products (processed RNAs and polypeptides) that are linked to the gene features. Annotations for existing features can also go into the featureprop table using the Chado feature_property ontology (defined in chado/load/etc/feature_property.obo) and the comment or description terms as appropriate. The purpose of the feature property ontology (and the related chado/load/etc/genbank_feature_property.obo file) is to capture terms that are likely to appear in GFF or GenBank sequence files. In theory there is no overlap between these ontologies and the Sequence Ontology.

Insofar as the GFF file holding the annotations:

The gff spec states: Two reserved attributes, Ontology_term and Dbxref, can be used to establish links between a GFF3 feature and a data record contained in another database. Ontology_term is reserved for associations to ontologies, such as the Gene Ontology. Dbxref is used for all other cross references. While there is no firm boundary line between these two concepts, curators tend to treat ontology associations differently and hence ontology terms have been given their own reserved attribute label.

similarly, NCBI calls most things dbxrefs in a much broader definition than the one i use above.

Here's the conflict. KEGG terms, for example, are not ontologies. But when we read the GFF file, we parse Ontology_terms into feature_cvterm, dbxrefs to dbxrefs, and everything else to props. So for the annotations to go into feature_cvterm, they would need to be in the GFF under ontology_terms.

As monica phrased her doubts:

With GO, I get it - a GO term refers to a formal, accessioned description of a gene function (e.g. http://amigo.geneontology.org/amigo/term/GO:0003676). A GO term does not also refer to a protein sequence - you annotate the protein sequence with the GO term. An InterPro accession is an accessioned ‘signature’ (which is a combo of HMMs, profiles, position-specific scoring matrices or regular expressions), which is annotated by curators with free-text descriptions from the literature. (And they can also be associated with a GO term). As such, I view InterPro domain accessions more as entries within a very authoritative database, rather than a controlled vocabulary. Although perhaps the domain name is enough to call it a controlled vocabulary at this point?

The consequence of these decisions is we display featureprops, feature_cvterms, and feature_dbxrefs in different locations and in different ways to end users.

bradfordcondon avatar Nov 07 '18 14:11 bradfordcondon