AnchorWave Empty CDS file from gff2seq

Hello,

I am getting an empty CDS file from running gff2seq with no error message. My command is anchorwave gff2seq -i GCF_003119195.2_ASM311919v2_genomic.gff -r GCF_003119195.2_ASM311919v2_genomic.fna -o cds.fa The genomic data is directly from NCBI https://www.ncbi.nlm.nih.gov/genome/95891?genome_assembly_id=1473191

I am running it on Ubuntu 20.04.4 in a conda environment after only installing anchorwave.

Sep 02 '22 13:09 nwespe

Hi, thanks for giving AnchorWave a try. The format of GCF_003119195.2_ASM311919v2_genomic.gff is different from what AnchorWave expects. Every gene has the same ID "gene-". I am not sure I could parse this GFF file even manually.

Do you have the GFF file for another genome, please?

Sep 07 '22 01:09 baoxingsong

@baoxingsong Hi Baoxing, some of us in the Buckler Lab recently encountered this and it caused quite a bit of confusion since AnchorWave provides no error or message to explain why the CDS file is empty.

Would it be possible to add a message (e.g. "Please check your GFF for genes with a unique ID") when gff2seq writes an empty CDS file?

Sep 28 '23 17:09 matthewwiese

I've received correspondence related to this open issue, and want to post further details on how I solved my situation in case it helps others.

After comparing with a working GFF, I discovered that the non-working GFF lacked a gene type (there were only CDS and mRNA), whereas the working GFF associated each mRNA to a gene with the same start/stop/strandedness etc.

For others, inspect your non-working GFF and compare how it associates CDS, mRNA, and gene together (paying particular attention to the attributes detailed in column 9):

##gff-version   3
chr7    CLEAN   gene    25420269        25421320        .       -       .       ID=Pgl_GLEAN_10006696;Name=Pgl_GLEAN_10006696;
chr7    CLEAN   mRNA    25420269        25421320        .       -       .       ID=Pgl_GLEAN_10006696.mrna;Parent=Pgl_GLEAN_10006696;Name=Pgl_GLEAN_10006696.mrna;
chr7    GLEAN   CDS     25421320        25421713        .       -       0       ID=cds1;Parent=Pgl_GLEAN_10006696.mrna;
chr7    GLEAN   CDS     25420562        25420635        .       -       2       ID=cds2;Parent=Pgl_GLEAN_10006696.mrna;
chr7    GLEAN   CDS     25420153        25420269        .       -       0       ID=cds3;Parent=Pgl_GLEAN_10006696.mrna;

Props to @agostof for help with this

Oct 30 '23 14:10 matthewwiese

Hi,

I've encountered the above issue with several different assemblies from independent groups downloaded directly from NCBI, for example one here. There is nothing obviously wrong with the GFF files, and in this case it doesn't lack a gene category as suggested above.

Here's the command I'm using

anchorwave gff2seq -r GCA_029852735.1_ASM2985273v1_genomic.fna -i genomic.gff -o test

There is no error message, giving the impression the command worked, but the resulting file is empty. It's unclear to me what about the GFF format of this file Anchorwave doesn't like.

Any suggestions?

Mar 26 '24 23:03 jgroh

@jgroh My guess is that AnchorWave's regex is stumbling on the pipes | in your GFF's ID attribute. I ran the command you provided on the GFF from NCBI you linked after modifying it:

##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
#!genome-build ASM2985273v1
#!genome-build-accession NCBI_Assembly:GCA_029852735.1
##sequence-region CM056809.1 1 97805473
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=3435
CM056809.1	Genbank	region	1	97805473	.	+	.	ID=CM056809.1:1..97805473;Dbxref=taxon:3435;Name=1;chromosome=1;collected-by=Onkar Nath;country=Australia: Queensland;cultivar=Hass;dev-stage=Mature Tree;gbkey=Src;genome=chromosome;isolate=MH2022;mol_type=genomic DNA;tissue-type=Leaves
CM056809.1	Genbank	gene	55876	68562	.	-	.	ID=gene-MRB53_000001;Name=MRB53_000001;gbkey=Gene;gene_biotype=protein_coding;locus_tag=MRB53_000001
CM056809.1	Genbank	mRNA	55876	68562	.	-	.	ID=rna-gnl;Parent=gene-MRB53_000001;gbkey=mRNA;locus_tag=MRB53_000001;orig_protein_id=gnl|WGS:JAQQXO|cds-35697;orig_transcript_id=gnl|WGS:JAQQXO|g1.t1;product=hypothetical protein
CM056809.1	Genbank	exon	68481	68562	.	-	.	ID=exon-gnl|WGS:JAQQXO|g1.t1-1;Parent=rna-gnl|WGS:JAQQXO|g1.t1;gbkey=mRNA;locus_tag=MRB53_000001;orig_protein_id=gnl|WGS:JAQQXO|cds-35697;orig_transcript_id=gnl|WGS:JAQQXO|g1.t1;product=hypothetical protein
CM056809.1	Genbank	exon	67843	67935	.	-	.	ID=exon-gnl|WGS:JAQQXO|g1.t1-2;Parent=rna-gnl|WGS:JAQQXO|g1.t1;gbkey=mRNA;locus_tag=MRB53_000001;orig_protein_id=gnl|WGS:JAQQXO|cds-35697;orig_transcript_id=gnl|WGS:JAQQXO|g1.t1;product=hypothetical protein
CM056809.1	Genbank	exon	67005	67292	.	-	.	ID=exon-gnl|WGS:JAQQXO|g1.t1-3;Parent=rna-gnl|WGS:JAQQXO|g1.t1;gbkey=mRNA;locus_tag=MRB53_000001;orig_protein_id=gnl|WGS:JAQQXO|cds-35697;orig_transcript_id=gnl|WGS:JAQQXO|g1.t1;product=hypothetical protein
CM056809.1	Genbank	CDS	68481	68562	.	-	0	ID=cds-KAJ8646978.1;Parent=rna-gnl|WGS:JAQQXO|g1.t1;Dbxref=NCBI_GP:KAJ8646978.1;Name=KAJ8646978.1;gbkey=CDS;locus_tag=MRB53_000001;orig_transcript_id=gnl|WGS:JAQQXO|g1.t1;product=hypothetical protein;protein_id=KAJ8646978.1
CM056809.1	Genbank	CDS	67843	67935	.	-	2	ID=cds-KAJ8646978.1;Parent=rna-gnl|WGS:JAQQXO|g1.t1;Dbxref=NCBI_GP:KAJ8646978.1;Name=KAJ8646978.1;gbkey=CDS;locus_tag=MRB53_000001;orig_transcript_id=gnl|WGS:JAQQXO|g1.t1;product=hypothetical protein;protein_id=KAJ8646978.1
CM056809.1	Genbank	CDS	67005	67292	.	-	2	ID=cds-KAJ8646978.1;Parent=rna-gnl|WGS:JAQQXO|g1.t1;Dbxref=NCBI_GP:KAJ8646978.1;Name=KAJ8646978.1;gbkey=CDS;locus_tag=MRB53_000001;orig_transcript_id=gnl|WGS:JAQQXO|g1.t1;product=hypothetical protein;protein_id=KAJ8646978.1

The mRNA value was changed from ID=rna-gnl|WGS:JAQQXO|g1.t1; to ID=rna-gnl;.

And this is the result I get with AnchorWave v1.2.3:

>rna-gnl gene-MRB53_000001
ATGGAAGGACCCCACCGCACCATCTATCATGAGTGGGCCACATTTGGTGAAGGGCCCACTACTAGTAACCAGGACGTTACCAAAGAAGAAGAATCTGACTTGGGGTGTGTCCGCTTCCGCCTTTTGAATTGGGCAAATCACAGCAGGCAGCAAGCAATGTTTGGTTCCACGAATTCTTTTGGGCAGTCTTCAAGTAGCCCATTTGGAACCCAATCAGTTTTTGGGCAGACAAACAATGCTACCACCAATCCTTTTGCTCCCAAACCCTTTGGTAGTACAAGCCCTTTTGGTTCCCAGACTGGCAGTTCTGTATTTGGTGGCACATCAACTGGTGTATTTGGAACACCACAATCTTCTACACCAGCGTTTGGTGCTTCGACGGCACCCGCTTTTGGAAGCTCGGTGCCTGGTTTTGGGGCGCCATCAACTCCGTCTTTTGGTAGTGCTTCGTCCTCTTTTGGTG

You might have to manually edit your GFF unless @baoxingsong updates the regular expression for you.

Mar 29 '24 15:03 matthewwiese

I took a second look at this, and it's actually not just the pipes but the name itself. For example, none of the following ID values for the mRNA line work:

rna-gnlt1
rna-gnl.t1
rna-gnl:g1

Only rna-gnl results in a successful execution given my tiny subset of your GFF. Baoxing will have to chime in. I am not familiar enough with the GFF spec to understand what AnchorWave is doing here.

Mar 29 '24 22:03 matthewwiese

OK yeah I do think it's a regex issue with the pipes | and other ignored characters.

My hunch is that the CDS parsing of Parent=rna-gnl|WGS:JAQQXO|g1.t1; breaks on the pipes etc, such that the parsed value is rna-gnl and the rest is lost. That fits with Baoxing's regex from here - online regex demo.

Edit: So to clarify, the previous three values from this comment actually do work. You just have to update the CDS entries to match. The regex finds them just fine, so long as there isn't a pipe.

Mar 29 '24 22:03 matthewwiese

Hello, I have a question for you. When I run the code anchorwave gff2seq -i SL5.0.gff3-r CDs.fa > Anchorwave.log 2>&1, there is no error but the generated cds file is empty. And my gff3 file and fasta file have the same chromosome names from Chr01 to Chr12 respectively

Jun 05 '24 09:06 Zhaoqiyue5

Thanks @matthewwiese for digging into this, finally getting back around to it. For the specific gff I reference above, the issue was fixed by editing the gff to replace the pipe character: sed 's/|WGS:JAQQXO|/_WGS:JAQQXO_/g' {input} > {output}

@baoxingsong I agree it would be helpful to include an error message for cases like this along the lines of "mRNA and CDS records in GFF contain unrecognized characters" or that links to the proper regex

Sep 27 '24 18:09 jgroh

AnchorWave AnchorWave copied to clipboard

Empty CDS file from gff2seq

AnchorWave
AnchorWave copied to clipboard