AnchorWave
AnchorWave copied to clipboard
Empty CDS file from gff2seq
Hello,
I am getting an empty CDS file from running gff2seq with no error message.
My command is anchorwave gff2seq -i GCF_003119195.2_ASM311919v2_genomic.gff -r GCF_003119195.2_ASM311919v2_genomic.fna -o cds.fa
The genomic data is directly from NCBI https://www.ncbi.nlm.nih.gov/genome/95891?genome_assembly_id=1473191
I am running it on Ubuntu 20.04.4 in a conda environment after only installing anchorwave.
Hi, thanks for giving AnchorWave a try. The format of GCF_003119195.2_ASM311919v2_genomic.gff is different from what AnchorWave expects. Every gene has the same ID "gene-". I am not sure I could parse this GFF file even manually.
Do you have the GFF file for another genome, please?
@baoxingsong Hi Baoxing, some of us in the Buckler Lab recently encountered this and it caused quite a bit of confusion since AnchorWave provides no error or message to explain why the CDS file is empty.
Would it be possible to add a message (e.g. "Please check your GFF for genes with a unique ID") when gff2seq
writes an empty CDS file?
I've received correspondence related to this open issue, and want to post further details on how I solved my situation in case it helps others.
After comparing with a working GFF, I discovered that the non-working GFF lacked a gene
type (there were only CDS
and mRNA
), whereas the working GFF associated each mRNA
to a gene
with the same start/stop/strandedness etc.
For others, inspect your non-working GFF and compare how it associates CDS
, mRNA
, and gene
together (paying particular attention to the attributes detailed in column 9):
##gff-version 3
chr7 CLEAN gene 25420269 25421320 . - . ID=Pgl_GLEAN_10006696;Name=Pgl_GLEAN_10006696;
chr7 CLEAN mRNA 25420269 25421320 . - . ID=Pgl_GLEAN_10006696.mrna;Parent=Pgl_GLEAN_10006696;Name=Pgl_GLEAN_10006696.mrna;
chr7 GLEAN CDS 25421320 25421713 . - 0 ID=cds1;Parent=Pgl_GLEAN_10006696.mrna;
chr7 GLEAN CDS 25420562 25420635 . - 2 ID=cds2;Parent=Pgl_GLEAN_10006696.mrna;
chr7 GLEAN CDS 25420153 25420269 . - 0 ID=cds3;Parent=Pgl_GLEAN_10006696.mrna;
Props to @agostof for help with this
Hi,
I've encountered the above issue with several different assemblies from independent groups downloaded directly from NCBI, for example one here. There is nothing obviously wrong with the GFF files, and in this case it doesn't lack a gene
category as suggested above.
Here's the command I'm using
anchorwave gff2seq -r GCA_029852735.1_ASM2985273v1_genomic.fna -i genomic.gff -o test
There is no error message, giving the impression the command worked, but the resulting file is empty. It's unclear to me what about the GFF format of this file Anchorwave doesn't like.
Any suggestions?
@jgroh My guess is that AnchorWave's regex is stumbling on the pipes |
in your GFF's ID
attribute. I ran the command you provided on the GFF from NCBI you linked after modifying it:
##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
#!genome-build ASM2985273v1
#!genome-build-accession NCBI_Assembly:GCA_029852735.1
##sequence-region CM056809.1 1 97805473
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=3435
CM056809.1 Genbank region 1 97805473 . + . ID=CM056809.1:1..97805473;Dbxref=taxon:3435;Name=1;chromosome=1;collected-by=Onkar Nath;country=Australia: Queensland;cultivar=Hass;dev-stage=Mature Tree;gbkey=Src;genome=chromosome;isolate=MH2022;mol_type=genomic DNA;tissue-type=Leaves
CM056809.1 Genbank gene 55876 68562 . - . ID=gene-MRB53_000001;Name=MRB53_000001;gbkey=Gene;gene_biotype=protein_coding;locus_tag=MRB53_000001
CM056809.1 Genbank mRNA 55876 68562 . - . ID=rna-gnl;Parent=gene-MRB53_000001;gbkey=mRNA;locus_tag=MRB53_000001;orig_protein_id=gnl|WGS:JAQQXO|cds-35697;orig_transcript_id=gnl|WGS:JAQQXO|g1.t1;product=hypothetical protein
CM056809.1 Genbank exon 68481 68562 . - . ID=exon-gnl|WGS:JAQQXO|g1.t1-1;Parent=rna-gnl|WGS:JAQQXO|g1.t1;gbkey=mRNA;locus_tag=MRB53_000001;orig_protein_id=gnl|WGS:JAQQXO|cds-35697;orig_transcript_id=gnl|WGS:JAQQXO|g1.t1;product=hypothetical protein
CM056809.1 Genbank exon 67843 67935 . - . ID=exon-gnl|WGS:JAQQXO|g1.t1-2;Parent=rna-gnl|WGS:JAQQXO|g1.t1;gbkey=mRNA;locus_tag=MRB53_000001;orig_protein_id=gnl|WGS:JAQQXO|cds-35697;orig_transcript_id=gnl|WGS:JAQQXO|g1.t1;product=hypothetical protein
CM056809.1 Genbank exon 67005 67292 . - . ID=exon-gnl|WGS:JAQQXO|g1.t1-3;Parent=rna-gnl|WGS:JAQQXO|g1.t1;gbkey=mRNA;locus_tag=MRB53_000001;orig_protein_id=gnl|WGS:JAQQXO|cds-35697;orig_transcript_id=gnl|WGS:JAQQXO|g1.t1;product=hypothetical protein
CM056809.1 Genbank CDS 68481 68562 . - 0 ID=cds-KAJ8646978.1;Parent=rna-gnl|WGS:JAQQXO|g1.t1;Dbxref=NCBI_GP:KAJ8646978.1;Name=KAJ8646978.1;gbkey=CDS;locus_tag=MRB53_000001;orig_transcript_id=gnl|WGS:JAQQXO|g1.t1;product=hypothetical protein;protein_id=KAJ8646978.1
CM056809.1 Genbank CDS 67843 67935 . - 2 ID=cds-KAJ8646978.1;Parent=rna-gnl|WGS:JAQQXO|g1.t1;Dbxref=NCBI_GP:KAJ8646978.1;Name=KAJ8646978.1;gbkey=CDS;locus_tag=MRB53_000001;orig_transcript_id=gnl|WGS:JAQQXO|g1.t1;product=hypothetical protein;protein_id=KAJ8646978.1
CM056809.1 Genbank CDS 67005 67292 . - 2 ID=cds-KAJ8646978.1;Parent=rna-gnl|WGS:JAQQXO|g1.t1;Dbxref=NCBI_GP:KAJ8646978.1;Name=KAJ8646978.1;gbkey=CDS;locus_tag=MRB53_000001;orig_transcript_id=gnl|WGS:JAQQXO|g1.t1;product=hypothetical protein;protein_id=KAJ8646978.1
The mRNA
value was changed from ID=rna-gnl|WGS:JAQQXO|g1.t1;
to ID=rna-gnl;
.
And this is the result I get with AnchorWave v1.2.3:
>rna-gnl gene-MRB53_000001
ATGGAAGGACCCCACCGCACCATCTATCATGAGTGGGCCACATTTGGTGAAGGGCCCACTACTAGTAACCAGGACGTTACCAAAGAAGAAGAATCTGACTTGGGGTGTGTCCGCTTCCGCCTTTTGAATTGGGCAAATCACAGCAGGCAGCAAGCAATGTTTGGTTCCACGAATTCTTTTGGGCAGTCTTCAAGTAGCCCATTTGGAACCCAATCAGTTTTTGGGCAGACAAACAATGCTACCACCAATCCTTTTGCTCCCAAACCCTTTGGTAGTACAAGCCCTTTTGGTTCCCAGACTGGCAGTTCTGTATTTGGTGGCACATCAACTGGTGTATTTGGAACACCACAATCTTCTACACCAGCGTTTGGTGCTTCGACGGCACCCGCTTTTGGAAGCTCGGTGCCTGGTTTTGGGGCGCCATCAACTCCGTCTTTTGGTAGTGCTTCGTCCTCTTTTGGTG
You might have to manually edit your GFF unless @baoxingsong updates the regular expression for you.
I took a second look at this, and it's actually not just the pipes but the name itself. For example, none of the following ID
values for the mRNA
line work:
rna-gnlt1
rna-gnl.t1
rna-gnl:g1
Only rna-gnl
results in a successful execution given my tiny subset of your GFF. Baoxing will have to chime in. I am not familiar enough with the GFF spec to understand what AnchorWave is doing here.
OK yeah I do think it's a regex issue with the pipes |
and other ignored characters.
My hunch is that the CDS
parsing of Parent=rna-gnl|WGS:JAQQXO|g1.t1;
breaks on the pipes etc, such that the parsed value is rna-gnl
and the rest is lost. That fits with Baoxing's regex from here - online regex demo.
Edit: So to clarify, the previous three values from this comment actually do work. You just have to update the CDS
entries to match. The regex finds them just fine, so long as there isn't a pipe.
Hello, I have a question for you. When I run the code anchorwave gff2seq -i SL5.0.gff3-r CDs.fa > Anchorwave.log 2>&1
, there is no error but the generated cds file is empty. And my gff3 file and fasta file have the same chromosome names from Chr01 to Chr12 respectively
Thanks @matthewwiese for digging into this, finally getting back around to it. For the specific gff I reference above, the issue was fixed by editing the gff to replace the pipe character:
sed 's/|WGS:JAQQXO|/_WGS:JAQQXO_/g' {input} > {output}
@baoxingsong I agree it would be helpful to include an error message for cases like this along the lines of "mRNA and CDS records in GFF contain unrecognized characters" or that links to the proper regex