Chado icon indicating copy to clipboard operation
Chado copied to clipboard

Chado CV/DB table challenge

Open spficklin opened this issue 6 years ago • 14 comments

We are having a bit of trouble organizing controlled vocabularies in the CV and DB tables. The problem comes in with the "short name" (e.g. GO, SO, etc...) for the vocabulary which gets stored in the db.name column. The db table also houses the URL and URL prefix. The URL prefix is meant to be used for any dbxref (even if it's not a term) to rebuild a web link to point to the cross reference.

This works great for the Gene Ontology because all of the terms have one short name (GO), so that record get stores in the db table and the three vocabularies for GO (i.e. molecular_function, biological_process, cellular_component) all get stored in in the cv table. This works great! The db.urlprefix can be used to build all of the link-outs because GO uses the same URL for all of them.

This breaks for vocabularies such as EDAM. "EDAM is an ontology of bioinformatics types of data including identifiers, data formats, operations and topics." In this case the entire vocabulary is named 'EDAM' and it has separate vocabularies (like GO) named 'data', 'operation', 'format', etc. The terms in this ontology are not prefixed with 'EDAM' but rather the vocabulary name. So, to make link-outs work, this forced us to reverse how we store them. We put 'EDAM' in the cv table... which is not ideal.

KEGG provides more challenges. We want to store KEGG orthologs and pathway as cvterms. These terms are organized into KEGG's BRITE hierarchy. So, in a way it seems like an ontology. Because orthologs and pathways have different URLs we are forced to store the PATH and KO vocabularies in the db table. But, if we want to preserve the organization of the BRITE heirarchy we have to lump these into the same KEGG BRITE cv record. Same problem as EDAM, except that higher level terms in the BRITE hierarchy don't have corresponding pages so we can't link to them. Some of our term links will be broken.

So, I think we need to adjust how we recommend and how our scripts store vocabularies in Chado. Here's one suggestion

  1. We keep the db.urlprefix field but use that as a "default". This ensures continued backwards compatibility.
  2. Because a database can use multiple URLs for all kinds of data (not just terms) and even for different vocabularies, we create a new dburl table where URLs can be stored. We document a storage format that all Chado client applications can understand if they follow it. For example, Tripal currently uses {db} and {accession} as tokens in the URL string. This allows the URL to be constructed by filling in the tokens and those tokens can appear anywhere in the string. Whereas now, usage implies that the accession is added to the end of the value in the db.urlprefix.
  3. We create a new dbxref_dburl table that allows a dbxref record to specify a different URL to use. So, if one is not specified then the default urlprefix is used.

This approach allows us to have only one true database record, and put the CV's where they go in the cv table. It would allow us to have multiple URLs per database, and indicate a different URL construction string if needed. This would solve our KEGG problem as well because we could leave the db.urlprefix blank and add to the dburl table specific URLs for dbxref records that use them.

Any thoughts?

spficklin avatar Jan 24 '18 18:01 spficklin