augur icon indicating copy to clipboard operation
augur copied to clipboard

Validate annotations produced from ancestral + translate

Open corneliusroemer opened this issue 2 years ago • 4 comments

I've encountered a bug that took me very long to figure out. Augur export reported the following error:

Validating schema of 'auspice/monkeypox_global.json'...
        ERROR: 'nuc' is a required property. Trace: properties - meta - properties - genome_annotations - required
Validation of 'auspice/monkeypox_global.json' failed.

------------------------
Validation of auspice/monkeypox_global.json failed. Please check this in a local instance of `auspice`, as it is not expected to display correctly. 
------------------------

Now it turns out, that export requires nuc annotations, and these come in usually through aa_mut.json from augur translate.

I was reading in annotations from a .gff into translate, something that's theoretically supported. However, it's actually not possible to read in nuc annotation in the current implementation.

It would have very much sped up debugging if augur translate had warned me (or even errored) when it realised that it was lacking nuc annotations.

I'd propose an error if nuc not output into aa_mut.json:

[Error] Could not read in `nuc` annotations. Please check the annotation in your input file. For `.gff` the line needs to look like this:
MT903344.1	Genbank	source	1	197233	.	+	.	locus_tag=nuc

Related to #881

corneliusroemer avatar May 25 '22 14:05 corneliusroemer

I think this issue arose as part of this Slack conversation. @corneliusroemer, am I correct in this?

huddlej avatar Jun 08 '22 19:06 huddlej

(1 year later...)

The annotations schema now requires 'nuc' to be present (d6246ca052478446f7179e230e842a34f93e4cd4) however neither augur ancestral nor augur translate validate their outputs. Reading any node-data file (via NodeDataReader) with an "annotations" block will also validate against the schema, although in this case that's still going to be first encountered in augur export v2.

Conceptually we could have the annotations from ancestral define 'nuc' and translate define the CDSs, and they'll be merged in augur export, however I think it's sensible to require translate to add a 'nuc' block, which is why I made it a required property. If augur export sees multiple annotations.nuc entries it should really ensure they are the same length! (The JSON merging happens within NodeDataReader)

jameshadfield avatar Aug 30 '23 22:08 jameshadfield

Just a note, I ran into this issue working on my PRRSV dataset (https://github.com/mazeller/NextClade_Datasets/tree/main/prrsv_yimim_v3). I needed to append the following line to my GFF manually.

DQ478308.1 Genbank source 1 603 . + . locus_tag=nuc

mazeller avatar Jan 19 '24 22:01 mazeller

however I think it's sensible to require translate to add a 'nuc' block, which is why I made it a required property

As of 1d17699e960d3805a0a586d7ccf3e9a550d53ac9 (in master, but not yet released) augur translate will always export this. (I missed this issue when scanning, it's very similar to #953.)

Just a note, I ran into this issue working on my PRRSV dataset (https://github.com/mazeller/NextClade_Datasets/tree/main/prrsv_yimim_v3). I needed to append the following line to my GFF manually.

P.S. recent augur PRs (merged but not released) will fix this, we'll now read the nuc coords from the sequence-region pragma in your GFF ("##sequence-region DQ478308.1 1 603").

jameshadfield avatar Jan 21 '24 20:01 jameshadfield