augur
augur copied to clipboard
BUG: `augur translate` not faithful to `gff3` standard, can't annotate nucs, which are required.
Related to #881
Currently, augur translate
does not implement gff3
standard properly. Confusingly, if you use .gff
files generated automatically by Genbank from a given .gb
file, augur translate will behave differently.
Importantly, nuc
s won't be annotated if you use a genemap from gff
rather than gb
which is clearly a bug.
The solution is to:
a) make augur translate accept standard gff3 files, outputting nuc annotations etc
b) make augur translate behave identical whether it reads in a .gb
or .gff
annotation (that are identical in information).
I had opened #950 for a stopgap fix for the "no nuc annotation with .gff" bug, but the actual solution is to read .gff
properly and refactor augur translate
to be standards conforming.
Below are comments made on the issue and PR respectively.
I got extremely confused by this bug. I encountered it as the following error in a workflow that uses augur translate and just has a .gff as input, not a .gb with nuc annotation.
The error is:
Validating schema of 'auspice/monkeypox_global.json'... ERROR: 'nuc' is a required property. Trace: properties - meta - properties - genome_annotations - required Validation of 'auspice/monkeypox_global.json' failed. ------------------------ Validation of auspice/monkeypox_global.json failed. Please check this in a local instance of `auspice`, as it is not expected to display correctly. ------------------------
Thrown by augur export. I couldn't figure out how to get it to just read nuc from the .gff, which I assumed was possible. Couldn't believe it isn't. But that's the way it seems to be 😬
This is a very serious bug, it really makes using the same genemap for Nextclade and Nextclade reference build of monkeypox impossible which is not good, discrepancies arise like this.
Currently, augur translate cannot output nuc annotation, if .gff is used as input. This is due to the features being loaded being overly restrictive to genes.
Allowing source to be read in as features allows the .gff to contain nuc annotations in the following format:
MT903344.1 Genbank source 1 197233 . + . locus_tag=nuc That way augur export doesn't crash if a .gff` is used as annotation input for translate.
Related issue(s) Related to https://github.com/nextstrain/augur/issues/881
This fix allows me to use the same genemap.gff for both Nextclade and the workflow that creates the Nextclade ref tree. This is very good for consistency and reduced maintenance effort when Monkeypox genemap is changing a lot.
Testing I don't know if this touches any other functionality. It's a util function so there's a risk. But then at the same time, it shouldn't be wrong to output a genemap with nuc annotated.
CI passes so looks good. One may want to systematically check every function where this gff load utility is used. But first would be good to see if this PR is usable or it has some systematic flaw.
From @huddlej:
We discussed during triage meeting today that it would be good to know which parts of Augur call this function [read gff], to get a sense of the side-effects this change might have. Along those same lines, it would be awesome to see a functional test for the use case you describe here to confirm that this simple change has the effect you want and that future changes to this code won't break that functionality.
From @emmahodcroft
It's been sooooo long since I looked at this, and I would have to go dig to go find old TB files (and I'm travelling without external HDD at the moment), but I'd be a little cautious here, as it flickers some vague memory inside of me from when I used to TB stuff.
Unfortunately I really don't remember much of this so it would take some digging for me to remember.... so mostly just here to wave a caution flag to ensure that this works how we expect across VCF and various GFF files for, ex, bacteria.
it would be good to know which parts of Augur call this function
augur translate
parses GFF and GenBank files using utils.load_features, which relies on BCBio.GFF and Bio.SeqIO, respectively. No other code in augur uses load_features
. Some comments in that code indicate it was implemented just for TB.
I'd propose making two functions io.read_gff
and io.read_genbank
and adding a whole bunch of tests!
just here to wave a caution flag to ensure that this works how we expect across VCF and various GFF files for, ex, bacteria
I had this thought too! We have 2 TB tests (tests/builds/tb
and tests/builds/tb_drm
) however they're (a) not run by CI and (b) the outputs not diff'd, so we'd only know about a breaking change if it caused a command to fail. I just ran the snakefiles in those directories and they still work, so that's something!
Parsing GFF
BioPython doesn't have a parser but recommends gffutils. gffutils has a small set of dependencies but uses a database layer which seemed a bit overkill (although doesn't seem slow in my testing):
# using the TB GFF which is in this repo & the gffutils incantation in their tutorial
db = gffutils.create_db("Mtb_H37Rv_NCBI_Annot.gff", dbfn="test.db", force=True, keep_order=True, sort_attribute_values=True, merge_strategy='merge')
# are all features read? GFF file has 8407 non-comment lines
len(list(db.all_features())) # 8407
region = [feat for feat in db.all_features() if feat.featuretype=='region'][0]
region.start # 1
region.stop # 4411532
region.strand # '+'
# The same, but for MPX genome MT903344 which Cornelius references above
db = gffutils.create_db("MT903344.gff3", dbfn="test.db", force=True, keep_order=True, sort_attribute_values=True, merge_strategy='merge')
len(list(db.all_features())) # 381, which matches my expectation
region = [feat for feat in db.all_features() if feat.featuretype=='region'][0]
region.start #1
region.end # 197233
region.strand # '+'
name vs locus_tag vs gene
Whichever library we use we're going to have to make some decisions (in code or via arguments) about what name to use for features -- id vs name vs locus_tag -- see https://github.com/nextstrain/augurlinos/issues/4 for some discussion about this... 5 years ago! Looking at genes (featuretype=='gene'
), which is what utils.load_features
does, we have the following situations from the two above GFFs (using an example gene from each):
GFF | .id |
.attributes['Name'] |
.attributes['locus_tag'] |
.attributes['gene'] |
---|---|---|---|---|
TB | gene10 | ppiA_Rv0009 | Rv0009 | ppiA |
MPX | gene-MPXV-UK_P2-013 | MPXV-UK_P2-013 | MPXV-UK_P2-013 | KeyError |
Differences in how we treat GFF vs GenBank
- For GFF we use 'gene' types ("featuretype") not CDS. For Genbank we do the opposite.
- For the name of each feature, for GFFs we try to use
gene
(annotation name, not feature type) and fallback tolocus_tag
if that doesn't exist. For Genbank we do the opposite.
I just ran into this issue, too, trying to use the Nextclade gene map GFF as an input to augur translate. After thinking about this more and reading the comments above, I had a couple of additional thoughts:
- +1 for James’s suggestions to create separate read functions for GFF and GenBank files and lots of functional tests for both!
- +1 for replacing BCBio.GFF with gffutils, since the latter is maintained (and recommended by BCBio’s author).
- We do run the TB and TB_DRM tests in CI through the test runner script.
- ~None of our example or production GFF files include “a comment that identifies the file format and version”, as required by GFF3 standard. I suspect none of our files will pass a GFF validator, for this simple (if silly) reason.~ (Edit: This is not true; the test GFFs in Augur and most of the Nextclade GFFs define versions and would validate. See subsequent comments below.)
- ~Few of~ our example or production GFF files (should) define the
seqname
field. Thestart
andend
coordinates we provide in a GFF only make sense in relation to a specific sequence which we need to define in this field. See the monkeypox gene map as an example of where this is done properly. (Edit: Most Nextclade GFFs do include sequence id!) - Instead of using the
source
value of thetype
column plusgene=nuc
approach that we use in the monkeypox gene map, we should use thesequence-region
pragma to define the complete start and end coordinates for each sequence id referenced in the file (this line of the monkeypox gene map). We'd need to confirm that gffutils parses this as expected, but this appears to be the standard way to represent this information. - Building on the discussion 5 years ago about the lack of standard naming conventions for the qualifier keys, the GFF3 specification does define expected key names for genes which include
ID
andName
. Since INSDC databases require alocus_tag
qualifier for all registered genomes, we should still accept (and perhaps prefer) this qualifier even whenID
andName
are provided. We should also consider allowing the user to specific the qualifier used for gene names in the GFF file they pass to any augur command that calls the forthcomingio.read_gff
function, for backwards compatibility with our own data like the Nextclade gene maps that usegene_name
as the qualifier key.
As an example of how we might properly format our GFF3 files for parsing in Augur (and elsewhere), here is the gene map for SARS-CoV-2 from nextclade dataset get --name sars-cov-2
:
# Gene map (genome annotation) of SARS-CoV-2 in GFF format.
# For gene map purpses we only need some of the columns. We substitute unused values with "." as per GFF spec.
# See GFF format reference at https://www.ensembl.org/info/website/upload/gff.html
# seqname source feature start end score strand frame attribute
. . gene 26245 26472 . + . gene_name=E
. . gene 26523 27191 . + . gene_name=M
. . gene 28274 29533 . + . gene_name=N
. . gene 266 13468 . + . gene_name=ORF1a
. . gene 13468 21555 . + . gene_name=ORF1b
. . gene 25393 26220 . + . gene_name=ORF3a
. . gene 27202 27387 . + . gene_name=ORF6
. . gene 27394 27759 . + . gene_name=ORF7a
. . gene 27756 27887 . + . gene_name=ORF7b
. . gene 27894 28259 . + . gene_name=ORF8
. . gene 28284 28577 . + . gene_name=ORF9b
. . gene 21563 25384 . + . gene_name=S
And here is what I would propose it should look like instead:
##gff-version 3.1.26
##sequence-region MN908947 1 29903
# seqname source feature start end score strand frame attribute
MN908947 GenBank gene 26245 26472 . + . ID=E
MN908947 GenBank gene 26523 27191 . + . ID=M
MN908947 GenBank gene 28274 29533 . + . ID=N
MN908947 GenBank gene 266 13468 . + . ID=ORF1a
MN908947 GenBank gene 13468 21555 . + . ID=ORF1b
MN908947 GenBank gene 25393 26220 . + . ID=ORF3a
MN908947 GenBank gene 27202 27387 . + . ID=ORF6
MN908947 GenBank gene 27394 27759 . + . ID=ORF7a
MN908947 GenBank gene 27756 27887 . + . ID=ORF7b
MN908947 GenBank gene 27894 28259 . + . ID=ORF8
MN908947 GenBank gene 28284 28577 . + . ID=ORF9b
MN908947 GenBank gene 21563 25384 . + . ID=S
My general feeling now is that we might:
- spend some time with the GFF3 specification, making sure we actually understand what we want our files to look like
- write tests for what files we should be able to parse
- implement
io.read_gff
to follow the expected GFF specification but with backwards compatibility for existing formats - implement proper GFF3 formats for all of our own files
I forgot to include context of what I was doing to run into this same issue: I've been developing an updated Nextstrain introductory tutorial to expand the current "Zika tutorial" and use SARS-CoV-2 instead. I was hoping to use nextclade dataset get
as a way to provision the reference data and gene map users would need to run all of the subsequent augur commands without needing a separate GenBank file. The nextclade dataset
interface seems generally like a great way to expose users to curated references for different pathogens. So, my hope is that the changes we're talking about in this issue would eventually allow us to use these data in an updated tutorial/workshop setting.
The immediate blocking issue in the new tutorial case is that augur translate
doesn't know to look for a gene_name
qualifier, but it's clear from the conversation above that this is a smaller part of a bigger rewrite that needs to happen.
Looking at other Nextclade datasets today, I noticed that most others follow the general GFF3 pattern we'd expect. For example, the H3N2 HA gene map looks like this:
##gff-version 3
##sequence-region CY163680.1 1 1737
CY163680.1 feature gene 18 65 . + . gene_name=SigPep
CY163680.1 feature gene 66 1052 . + . gene_name=HA1
CY163680.1 feature gene 1053 1715 . + . gene_name=HA2
This file validates as expected with the GFF3 validator. Also, this file works well with gffutils. For example, I can run the following code to extract the HA1 sequence from the corresponding reference FASTA file:
import gffutils
# Try to use any valid qualifier key as an id in the given order.
db = gffutils.create_db("genemap.gff", ":memory:", id_spec=["locus_tag", "gene", "gene_name"]
# Get coordinates for HA1.
db["HA1"]
# Get sequence for HA1 from a FASTA file.
db["HA1"].sequence("reference.fasta")
I like that gffutils lets us specify a list of valid qualifier keys to use as the gene id. This feature would eliminate the need to manually check for each separate key we support in utils.load_features
.
On the other hand, I don't see any functionality in gffutils that parses the sequence-region
pragma in a way that would properly give us the full extent of a sequence for the nuc
annotation. Interestingly, the BCBio parser does parse this pragma. For example, here is how it parses the TB GFF in augur's tests:
>>> from BCBio import GFF
>>> record = list(GFF.parse("tests//builds/tb/data/Mtb_H37Rv_NCBI_Annot.gff"))[0]
>>> record.annotations
{'gff-version': ['3'],
'sequence-region': [('MTB_anc', 0, 4411532)],
'species': ['https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=83332']}
I haven't found any examples of GFF files where people set the type
column value to source
as a way of defining the whole range of coordinates for the sequence (as we do in #976). Instead, the GFF spec suggests that the region
value might be the closest to a standard for this functionality. (Edit: I do see now that this is the standard name for this coordinate range in GenBank files like the Zika reference we use.)
+1 for "spend[ing] some time with the GFF3 specification"
gffutils doesn't parse any directives/pragmas, but it does extract them. Parsing is a simple matter after that:
>>> { fields[0]: tuple(fields[1:]) for fields in (line.split() for line in db.directives) }
{'gff-version': ('3',), 'sequence-region': ('CY163680.1', '1', '1737')}