AnchorWave
AnchorWave copied to clipboard
Anchorwave large chromosome error
Hello,
We are currently trying to run Anchorwave on the huge Axolotl genome (32 Gbp) and we are getting the following error:
Command:
anchorwave proali -t 19 -i GCF_000001405.39_GRCh38.p13_genomic.gff -as human_cds.fa -r GCF_000001405.39_GRCh38.p13_genomic.fna -ar human.sam -a axolotl.sam -s AmexG_v6_scaffolds_withMT-AJ584639_withUnplacedContigs.fa -y 0.4 -n hum_axo_anchors -R 1 -Q 1 -o human_vs_axolotl_alignment.maf -f human_vs_axolotl_alignment.f.maf
Error:
SSE4.1 is enabled
anchorwave: /Apps/anchorwave/src/service/TransferGffWithNucmerResult.cpp:427: void readSam(std::vector<AlignmentMatch>&, std::ifstream&, std::map<std::__cxx11::basic_string<char>, Transcript>&, int&, const double&, double&, std::set<std::__cxx11::basic_string<char> >&, int32_t&, int32_t&, int32_t&, int32_t&, int&, bool&, int&): Assertion `chromosomePosition==transcriptHashMap[elems[0]].getPEnd()' failed.
.command.sh: line 2: 59509 Aborted anchorwave proali -t 19 -i GCF_000001405.39_GRCh38.p13_genomic.gff -as human_cds.fa -r GCF_000001405.39_GRCh38.p13_genomic.fna -ar human.sam -a axolotl.sam -s AmexG_v6_scaffolds_withMT-AJ584639_withUnplacedContigs.fa -y 0.4 -n hum_axo_anchors -R 1 -Q 1 -o human_vs_axolotl_alignment.maf -f human_vs_axolotl_alignment.f.maf
We suspect this is due to the chromosome coordinates exceeding the size of the declared variables, which is a common problem with this genome. Is there any chance you could fix this? We would be happy to help with any testing or anything that would help solve this issue!
Thank you!
Could you please share where could I found those input files please?
Sure! We are uploading the files and will send them to you by the end of the week. Thanks!!
Similar error but different cause. This assertion can also fail when there is a slight overlap in the CDS coordinates such as in this example from GCF_000001405.26_GRCh38_genomic.gff:
790294:NC_000007.14 BestRefSeq CDS 94663334 94664513 . + 1 ID=cds26285;Parent=rna33791;Dbxref=GeneID:23089,Genbank:NP_001165908.1,HGNC:14005,MIM:609810;Name=NP_001165908.1;Note=isoform 3 is encoded by transcript variant 2;exception=ribosomal slippage;gbkey=CDS;gene=PEG10;product=retrotransposon-derived protein PEG10 isoform 3;protein_id=NP_001165908.1
790295:NC_000007.14 BestRefSeq CDS 94664513 94665682 . + 0 ID=cds26285;Parent=rna33791;Dbxref=GeneID:23089,Genbank:NP_001165908.1,HGNC:14005,MIM:609810;Name=NP_001165908.1;Note=isoform 3 is encoded by transcript variant 2;exception=ribosomal slippage;gbkey=CDS;gene=PEG10;product=retrotransposon-derived protein PEG10 isoform 3;protein_id=NP_001165908.1
This overlap of 1bp breaks your method of counting chromosome positions in TransferGffWithNucmerResult.cpp::readSam() -- both function signatures.
Similar error but different cause. This assertion can also fail when there is a slight overlap in the CDS coordinates such as in this example from GCF_000001405.26_GRCh38_genomic.gff:
790294:NC_000007.14 BestRefSeq CDS 94663334 94664513 . + 1 ID=cds26285;Parent=rna33791;Dbxref=GeneID:23089,Genbank:NP_001165908.1,HGNC:14005,MIM:609810;Name=NP_001165908.1;Note=isoform 3 is encoded by transcript variant 2;exception=ribosomal slippage;gbkey=CDS;gene=PEG10;product=retrotransposon-derived protein PEG10 isoform 3;protein_id=NP_001165908.1 790295:NC_000007.14 BestRefSeq CDS 94664513 94665682 . + 0 ID=cds26285;Parent=rna33791;Dbxref=GeneID:23089,Genbank:NP_001165908.1,HGNC:14005,MIM:609810;Name=NP_001165908.1;Note=isoform 3 is encoded by transcript variant 2;exception=ribosomal slippage;gbkey=CDS;gene=PEG10;product=retrotransposon-derived protein PEG10 isoform 3;protein_id=NP_001165908.1
This overlap of 1bp breaks your method of counting chromosome positions in TransferGffWithNucmerResult.cpp::readSam() -- both function signatures.
Thanks. The above two lines indicate that the 94664513 bp belong to two CDSs at the same time, and the length of the intron is -1 bp.
From my point of view, this is a problem with the GFF file.
Agreed, however the world is a rough and tumble place. Letting the user know the source of the issue (rather than a simple assert failure), would help them 1) identify the offending line(s) in the GFF file so they can remedy the situation, and 2) not incorrectly blame the issue on your tool. Or, preferably you could report the problem as a warning and have a sensible default remedy coded ( disregard any overlap and treat as blunt ends, or remove/disqualify the overlapping CDS), that would allow your tool to run more robustly in a larger workflow.