AnchorWave icon indicating copy to clipboard operation
AnchorWave copied to clipboard

What information Anchorwave require to be able to parse the gff file correctly?

Open diaspj opened this issue 1 year ago • 3 comments

Hi,

First, congratulations on the development of Anchorwave, it seems a great addition to the Comparative Genomics toolkit!

I have several questions, the majority linked to the "Empty CDS file from gff2seq #36" issue raised by nwespe (https://github.com/baoxingsong/AnchorWave/issues/36), so this issue is about what information Anchorwave require to be able to parse the gff file correctly?

As example, I will be using the information on the gff file of "Zea_mays.AGPv4.34.gff3" (information on the first gene is reproduced bellow) that you have used on your paper.

1 wareLab chromosome 1 307041717 . . . ID=chromosome:1

1 gramene gene 44289 49837 . + . ID=gene:Zm00001d027230;biotype=protein_coding;gene_id=Zm00001d027230;logic_name=maker_gene 1 gramene mRNA 44289 49837 . + . ID=transcript:Zm00001d027230_T001;Parent=gene:Zm00001d027230;biotype=protein_coding;transcript_id=Zm00001d027230_T001 1 gramene five_prime_UTR 44289 44350 . + . Parent=transcript:Zm00001d027230_T001 1 gramene exon 44289 44947 . + . Parent=transcript:Zm00001d027230_T001;Name=IRS1_030385-RA.exon1;constitutive=1;ensembl_end_phase=0;ensembl_phase=-1;exon_id=IRS1_030385-RA.exon1;rank=1 1 gramene CDS 44351 44947 . + 0 ID=CDS:Zm00001d027230_T001;Parent=transcript:Zm00001d027230_T001;protein_id=Zm00001d027230_T001 1 gramene exon 45666 45803 . + . Parent=transcript:Zm00001d027230_T001;Name=IRS1_030385-RA.exon2;constitutive=1;ensembl_end_phase=0;ensembl_phase=0;exon_id=IRS1_030385-RA.exon2;rank=2 1 gramene CDS 45666 45803 . + 0 ID=CDS:Zm00001d027230_T001;Parent=transcript:Zm00001d027230_T001;protein_id=Zm00001d027230_T001 1 gramene exon 45888 46133 . + . Parent=transcript:Zm00001d027230_T001;Name=IRS1_030385-RA.exon3;constitutive=1;ensembl_end_phase=0;ensembl_phase=0;exon_id=IRS1_030385-RA.exon3;rank=3 1 gramene CDS 45888 46133 . + 0 ID=CDS:Zm00001d027230_T001;Parent=transcript:Zm00001d027230_T001;protein_id=Zm00001d027230_T001 1 gramene exon 46229 46342 . + . Parent=transcript:Zm00001d027230_T001;Name=IRS1_030385-RA.exon4;constitutive=1;ensembl_end_phase=0;ensembl_phase=0;exon_id=IRS1_030385-RA.exon4;rank=4 1 gramene CDS 46229 46342 . + 0 ID=CDS:Zm00001d027230_T001;Parent=transcript:Zm00001d027230_T001;protein_id=Zm00001d027230_T001 1 gramene exon 46451 46633 . + . Parent=transcript:Zm00001d027230_T001;Name=IRS1_030385-RA.exon5;constitutive=1;ensembl_end_phase=0;ensembl_phase=0;exon_id=IRS1_030385-RA.exon5;rank=5 1 gramene CDS 46451 46633 . + 0 ID=CDS:Zm00001d027230_T001;Parent=transcript:Zm00001d027230_T001;protein_id=Zm00001d027230_T001 1 gramene exon 47045 47262 . + . Parent=transcript:Zm00001d027230_T001;Name=IRS1_030385-RA.exon6;constitutive=1;ensembl_end_phase=2;ensembl_phase=0;exon_id=IRS1_030385-RA.exon6;rank=6 1 gramene CDS 47045 47262 . + 0 ID=CDS:Zm00001d027230_T001;Parent=transcript:Zm00001d027230_T001;protein_id=Zm00001d027230_T001 1 gramene CDS 47650 47995 . + 1 ID=CDS:Zm00001d027230_T001;Parent=transcript:Zm00001d027230_T001;protein_id=Zm00001d027230_T001 1 gramene exon 47650 48111 . + . Parent=transcript:Zm00001d027230_T001;Name=IRS1_030385-RA.exon7;constitutive=1;ensembl_end_phase=-1;ensembl_phase=2;exon_id=IRS1_030385-RA.exon7;rank=7 1 gramene three_prime_UTR 47996 48111 . + . Parent=transcript:Zm00001d027230_T001 1 gramene exon 48200 49247 . + . Parent=transcript:Zm00001d027230_T001;Name=IRS1_030385-RA.exon8;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=IRS1_030385-RA.exon8;rank=8 1 gramene three_prime_UTR 48200 49247 . + . Parent=transcript:Zm00001d027230_T001 1 gramene exon 49330 49837 . + . Parent=transcript:Zm00001d027230_T001;Name=IRS1_030385-RA.exon9;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=IRS1_030385-RA.exon9;rank=9 1 gramene three_prime_UTR 49330 49837 . + . Parent=transcript:Zm00001d027230_T001

1 gramene gene 50877 55716 . - . ID=gene:Zm00001d027231;biotype=protein_coding;gene_id=Zm00001d027231;logic_name=maker_gene

First question: Assuming that the "CDS row" is necessary, is it also required the information comprised in the "gene row", "exon row" or "mRNA row"?

Second question: For Anchorwave to be able to parse the gff file, does it require that the ID of CDS is of type "ID=CDS:Zm00001d027230_T001", with the "CDS:" prefix?

Third question: In case the "gene row" is also required, will Anchorwave require the ID of gene is of type "ID=gene:Zm00001d027230", with the "gene:" prefix?

Fourth question: I am applying your software to a number of strains belonging to a specific bacterial species whose genome sequence that have several rearrangements, inversions, duplications, insertions, etc. Bacterial genes do not have introns, is there some impediment using a gff file where genes are equivalent to CDSs, it is foreseeable that Anchorwave will raise some error at any point of the multistage pipeline?

Best regards,

Paulo Dias

PS: in attachment, I send a gff file automatically generated by Prokka, the most used software for genome annotation of Bacteria, that also was unable to generate the cds.fa file as pointed out in the "Empty CDS file from gff2seq #36" issue.

input_compatible_2.zip

diaspj avatar Oct 07 '22 17:10 diaspj