ModelPolisher
ModelPolisher copied to clipboard
Annotating models with the genome identifier
@Midnighter requests at https://github.com/SBRG/bigg_models/issues/368:
Many models in BiGG are currently annotated with a taxonomic identifier and a reference to the model itself, for example, as shown below.
<annotation>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:bqmodel="http://biomodels.net/model-qualifiers/" xmlns:bqbiol="http://biomodels.net/biology-qualifiers/">
<rdf:Description rdf:about="#iML1515">
<bqbiol:hasTaxon>
<rdf:Bag>
<rdf:li rdf:resource="http://identifiers.org/taxonomy/511145" />
</rdf:Bag>
</bqbiol:hasTaxon>
<bqmodel:is>
<rdf:Bag>
<rdf:li rdf:resource="http://identifiers.org/bigg.model/iML1515" />
</rdf:Bag>
</bqmodel:is>
</rdf:Description>
</rdf:RDF>
</annotation>
On the website, BiGG also provides a link to the genome sequence that was used to create the model, see, for example, http://bigg.ucsd.edu/models/iML1515.
Where possible, it would be great to also create MIRIAM compliant annotations of the genome on the model using the identifier from the genome database or RefSeq namespaces as defined at Identifiers.org.
Is this a task for ModelPolisher?
There is a ncbi_assembly id column in the genome table of BiGG DB, however, it appears to be empty.
Additionally there are accession_type and accession_value columns, where accession_type is currently one of ncbi_accession or ncbi_assembly. BiGG resolves the Genome link to a list of models and chromosomes, where the ncbi_accessions can be directly resolved as ids appended to https://www.ncbi.nlm.nih.gov/nuccore/. The ncbi_assembly entries are resolved to a list of chromosomes, however I don't know how this is done exactly.
From what I've gathered from BiGG, neither these accessions nor the taxon ids appear in any other place, so retrieving RefSeq annotations would likely require to fetch the corresponding entry from GenBank.
Had another look at the data BiGG provides and this is actually easy to do, albeit with some issues regarding the MIRIAM compliance. Must have been half asleep when looking at the issue last time...
All accession starting with NC_ or NZ_ can be converted to MIRIAM compliant URIs. All GCF_ entries should fit the genome assembly database pattern, there just seems to be a problem regarding resolution. If used as id in https://identifiers.org/insdc.gca:{$id}, this is resolved to https://www.ebi.ac.uk/ena/data/view/{$id}, where no entry is available for the id. Using the ncbi resource, however, the id can be resolved correctly, so for now we could create a non MIRIAM annotation this way. it might be worth to inquire about that issue, as it contradicts my understanding of how the resolution process works, if the given resources have different resolution capabilities. All other accesions could be added as non MIRIAM annotation the way done on the BiGG Models website, i.e. https://www.ncbi.nlm.nih.gov/nuccore/{$id}.
~~Do we want to add just the MIRIAM compliant annotations or all of them?~~
Edit: Just realized we have a INCLUDE_ANY_URI flag we could use here. What is the appropriate qualifier for these annotations, BQB_IS_VERSION_OF?
Implemented as described above in 2.1. branch. Leaving open to discuss the correct qualifier.
I could not find any models in BiGG that have an annotation like this.
The data is however contained in BiGG and the current implementation is in this code.
As for the qualifier: Here are descriptions of the biomodels qualifiers.
Looking at that and at the examples given in the 3.1 spec, I doubt "isVersionOf" is properly used here.
It seems to me there really is no perfectly fitting qualifier, but I would probably go with bqbiol:encodes, as the subject of the reference (the refseq) does indeed arguably encode (by proxy of a conceptual "function", i.e. algorithm) the model.
@draeger @matthiaskoenig I would say this is a judgement call you are best suited to make.
I think the qualifier isEncodedBy fits better because the model does not encode the genome but it refers to the encoding of its enzymes on that genome.
This unfortunately fell through the cracks with the merge, i.e. I forgot about it.
I am updating this as a bugfix ticket for a 2.2 release.