gff3toembl icon indicating copy to clipboard operation
gff3toembl copied to clipboard

How to prepare a novel strain/isolate of a bacteria?

Open peterjc opened this issue 7 years ago • 11 comments

(Some months back I did this successfully to submit a new strain from a different genus, so while I might be doing something wrong/different, I suspect the ENA validator has become stricter in the meantime)

For an un-named Serratia which does not (yet) have a unique NCBI taxonomy entry - the parent would be Serratia, taxid 613,

https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=613&lvl=3&lin=f&keep=1&srchmode=1&unlock

I have tried that, and the entry Serratia sp., taxid 616

https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=616&lvl=5&lin=f&keep=1&srchmode=1&unlock

$ gff3_to_embl --authors "Other A.N." -m "Serratia sp. XYZ annotated using Prokka." -g circular -c PROK -l XYZ -n 11 -f XYZ.embl "Serratia sp." 616 PRJEB00000 "Serratia sp. XYZ" XZY.gff

Either taxid approach fails validation:

$ java -jar embl-api-validator-1.1.149.jar XYZ.embl
...
ERROR: Scientific_name "Serratia sp." is not submittable. (MasterEntrySourceCheck_2)  line: 1 of XYZ.embl
ERROR: At least one of the following qualifiers "strain, environmental_sample, isolate" must exist when organism belongs to Bacteria. (OrganismAndRequiredQualifierCheck)  line: 17 of XYZ.embl
...

Here line 17 was the source feature. Manually editing the EMBL file to add a strain qualifier to the feature worked for me, but what exactly it wants for species name eludes me.

Am I missing something simple?

[Update: Yes, I was not giving the full organism name to gff3_to_embl, but also there was a problem with this version of the validator]

Should gff3_to_embl have options for inserting source feature qualifiers "strain, environmental_sample, isolate" (or should I have done this in prokka)?

Thanks!

peterjc avatar Nov 24 '16 12:11 peterjc

Hi Peter, A few months ago they blocked high level Taxa. They want you to use more specific taxa apparently. For completely new species theres a chicken and egg problem. In the olden days every assembly got a new taxon ID (which is why there are nearly 2 million). However now NCBI (who assign taxon IDs) demand a publication before they will grant one, so you have to use a temporary taxa, then update later. Its quite convolted.

As for strain, we submit using their API interface, so we have to provide a header in the embl, which then gets overwritten with whatever metadata is in the BioSample. Its possible they have moved the goal posts again in the week since we last submitted data.....

andrewjpage avatar Nov 24 '16 15:11 andrewjpage

Ah. My hunch was right, and yes - this is exactly the chicken-and-egg situation I am facing.

Could you elaborate on what you meant by using a temporary taxa?

peterjc avatar Nov 24 '16 16:11 peterjc

See https://github.com/enasequence/sequencetools/issues/15

This error turned out to be with the validator's internal settings:

ERROR: Scientific_name "Serratia sp." is not submittable. (MasterEntrySourceCheck_2)

However, to avoid this error I currently need to manually edit the source feature in my EMBL file:

ERROR: At least one of the following qualifiers "strain, environmental_sample, isolate" must exist when organism belongs to Bacteria. (OrganismAndRequiredQualifierCheck)

Perhaps for people like me using the ENA webin (web interface), rather than the API, there needs to be an extra set of options on gff3_to_embl to record the strain, environmental sample or isolate fields?

[Update: Human error, see below - I was not giving the full organism name to gff3_to_embl]

peterjc avatar Nov 28 '16 16:11 peterjc

(I've not actually submitted this new sequence yet - but I intend to try using the genus level taxid as before)

peterjc avatar Nov 28 '16 16:11 peterjc

Hi Peter, I cant replicate your error from the latest version of the validator. Using the following EMBL file, it validates fine (without a strain/ environmental_sample, isolate). Might be another issue somewhere?

ID   XXX; XXX; circular; genomic DNA; STD; PROK; 240 BP.
XX
AC   XXX;
XX
AC * _ERS111111SCcontig000001
XX
PR   Project:PRJEB1111;
XX
DE   XXX;
XX
RN   [1]
RA   Pathogen Genomics;
RT   "Draft assembly annotated with Prokka";
RL   Submitted (24-Nov-2016) to the INSDC.
XX
FH   Key             Location/Qualifiers
FH
FT   source          1..240
FT                   /organism="Staphylococcus aureus"
FT                   /mol_type="genomic DNA"
FT                   /db_xref="taxon:1280"
FT                   /note="ERS11111|SC|contig000001"
FT   tRNA            143..218
FT                   /product="tRNA-Val(tac)"
FT                   /inference="COORDINATES:profile:Aragorn:1.2.36"
FT                   /locus_tag="SAMEA1111111_00001"
SQ   Sequence 240 BP; 60 A; 60 C; 60 G; 60 T; 0 other;
     aatctacatt catatgtctg gtgactatag caaggaggtc acacctgttc ccatgccgaa        60
     cacagaagtt aagctcctta gcgtcgatgg tagttggact tacgttccgc tagagtagaa       120
     cgttgccagg caatgataaa tcggagaatt agctcagctg ggagagcatc tgccttacaa       180
     gcagagggtc ggcggttcga acccgtcatt ctccaccatt tattcttaca tattgccggc       240
//

andrewjpage avatar Nov 29 '16 15:11 andrewjpage

If you could edit your example above on GitHub to wrap it in triple back-ticks, GitHub will render it as a code block, and preserve the white space (so I can copy and paste it for testing here).

I suspect the key difference is your example has a taxid for a full species name, Staphylococcus aureus taxon 1280.

What happens if you change the example to pretend you have a new species/strain without a pre-existing taxon id, say Staphylococcus sp. XYZ, and try either taxon 1279 (Staphylococcus) or 29387 (Staphylococcus sp.)?

peterjc avatar Nov 29 '16 16:11 peterjc

Heres the file (as a file). example_embl.txt

So the genus taxon 1279 (Staphylococcus) gets through the validator, but you'll get an email in a few days/weeks informing you that the 'computer says NO'.

andrewjpage avatar Nov 30 '16 09:11 andrewjpage

Confirmed using embl-api-validator-1.1.150.jar. Likewise using taxon 613 and Serratia sp. XYZ passes validation:

FT   source          1..240
FT                   /organism="Serratia sp. XYZ"
FT                   /mol_type="genomic DNA"
FT                   /db_xref="taxon:613"

This was my problematic version:

FT   source          1..5090820
FT                   /organism="Serratia sp."
FT                   /mol_type="genomic DNA"
FT                   /db_xref="taxon:613"

I can pass validation by adding /strain="XYZ" (as mentioned above) or more simply by giving the full organism name in as /organism="Serratia sp. XYZ". With hindsight this seems obvious, your example was very helpful, thank you.

So there were at least two problems: I was not telling gff3_to_embl the full organism name, and the version of the validator I was using was (wrongly) being too strict.

I hope to submit this week, anticipating a query back about this being a novel species without a taxon ID. I will report back later with an update for future readers of this issue. Thanks!

peterjc avatar Nov 30 '16 11:11 peterjc

Good luck with your submission!

andrewjpage avatar Nov 30 '16 11:11 andrewjpage

Update on the ENA side of interest: http://listserver.ebi.ac.uk/pipermail/ena-announce/2017-January/000165.html

peterjc avatar Jan 25 '17 12:01 peterjc

Thanks

andrewjpage avatar Jan 25 '17 13:01 andrewjpage