AnchorWave icon indicating copy to clipboard operation
AnchorWave copied to clipboard

Anchorwave large chromosome error

Open diego-rt opened this issue 2 years ago • 5 comments

Hello,

We are currently trying to run Anchorwave on the huge Axolotl genome (32 Gbp) and we are getting the following error:

Command:

anchorwave proali -t 19 -i GCF_000001405.39_GRCh38.p13_genomic.gff -as human_cds.fa -r GCF_000001405.39_GRCh38.p13_genomic.fna -ar human.sam -a axolotl.sam -s AmexG_v6_scaffolds_withMT-AJ584639_withUnplacedContigs.fa -y 0.4 -n hum_axo_anchors -R 1 -Q 1 -o human_vs_axolotl_alignment.maf -f human_vs_axolotl_alignment.f.maf

Error:

  SSE4.1 is enabled
  anchorwave: /Apps/anchorwave/src/service/TransferGffWithNucmerResult.cpp:427: void readSam(std::vector<AlignmentMatch>&, std::ifstream&, std::map<std::__cxx11::basic_string<char>, Transcript>&, int&, const double&, double&, std::set<std::__cxx11::basic_string<char> >&, int32_t&, int32_t&, int32_t&, int32_t&, int&, bool&, int&): Assertion `chromosomePosition==transcriptHashMap[elems[0]].getPEnd()' failed.
  .command.sh: line 2: 59509 Aborted                 anchorwave proali -t 19 -i GCF_000001405.39_GRCh38.p13_genomic.gff -as human_cds.fa -r GCF_000001405.39_GRCh38.p13_genomic.fna -ar human.sam -a axolotl.sam -s AmexG_v6_scaffolds_withMT-AJ584639_withUnplacedContigs.fa -y 0.4 -n hum_axo_anchors -R 1 -Q 1 -o human_vs_axolotl_alignment.maf -f human_vs_axolotl_alignment.f.maf

We suspect this is due to the chromosome coordinates exceeding the size of the declared variables, which is a common problem with this genome. Is there any chance you could fix this? We would be happy to help with any testing or anything that would help solve this issue!

Thank you!

diego-rt avatar May 09 '22 10:05 diego-rt

Could you please share where could I found those input files please?

baoxingsong avatar May 12 '22 00:05 baoxingsong

Sure! We are uploading the files and will send them to you by the end of the week. Thanks!!

diego-rt avatar May 17 '22 14:05 diego-rt

Similar error but different cause. This assertion can also fail when there is a slight overlap in the CDS coordinates such as in this example from GCF_000001405.26_GRCh38_genomic.gff:

790294:NC_000007.14     BestRefSeq      CDS     94663334        94664513        .       +       1       ID=cds26285;Parent=rna33791;Dbxref=GeneID:23089,Genbank:NP_001165908.1,HGNC:14005,MIM:609810;Name=NP_001165908.1;Note=isoform 3 is encoded by transcript variant 2;exception=ribosomal slippage;gbkey=CDS;gene=PEG10;product=retrotransposon-derived protein PEG10 isoform 3;protein_id=NP_001165908.1
790295:NC_000007.14     BestRefSeq      CDS     94664513        94665682        .       +       0       ID=cds26285;Parent=rna33791;Dbxref=GeneID:23089,Genbank:NP_001165908.1,HGNC:14005,MIM:609810;Name=NP_001165908.1;Note=isoform 3 is encoded by transcript variant 2;exception=ribosomal slippage;gbkey=CDS;gene=PEG10;product=retrotransposon-derived protein PEG10 isoform 3;protein_id=NP_001165908.1

This overlap of 1bp breaks your method of counting chromosome positions in TransferGffWithNucmerResult.cpp::readSam() -- both function signatures.

rmhubley avatar Jun 06 '24 20:06 rmhubley

Similar error but different cause. This assertion can also fail when there is a slight overlap in the CDS coordinates such as in this example from GCF_000001405.26_GRCh38_genomic.gff:

790294:NC_000007.14     BestRefSeq      CDS     94663334        94664513        .       +       1       ID=cds26285;Parent=rna33791;Dbxref=GeneID:23089,Genbank:NP_001165908.1,HGNC:14005,MIM:609810;Name=NP_001165908.1;Note=isoform 3 is encoded by transcript variant 2;exception=ribosomal slippage;gbkey=CDS;gene=PEG10;product=retrotransposon-derived protein PEG10 isoform 3;protein_id=NP_001165908.1
790295:NC_000007.14     BestRefSeq      CDS     94664513        94665682        .       +       0       ID=cds26285;Parent=rna33791;Dbxref=GeneID:23089,Genbank:NP_001165908.1,HGNC:14005,MIM:609810;Name=NP_001165908.1;Note=isoform 3 is encoded by transcript variant 2;exception=ribosomal slippage;gbkey=CDS;gene=PEG10;product=retrotransposon-derived protein PEG10 isoform 3;protein_id=NP_001165908.1

This overlap of 1bp breaks your method of counting chromosome positions in TransferGffWithNucmerResult.cpp::readSam() -- both function signatures.

Thanks. The above two lines indicate that the 94664513 bp belong to two CDSs at the same time, and the length of the intron is -1 bp.

From my point of view, this is a problem with the GFF file.

baoxingsong avatar Jun 07 '24 06:06 baoxingsong

Agreed, however the world is a rough and tumble place. Letting the user know the source of the issue (rather than a simple assert failure), would help them 1) identify the offending line(s) in the GFF file so they can remedy the situation, and 2) not incorrectly blame the issue on your tool. Or, preferably you could report the problem as a warning and have a sensible default remedy coded ( disregard any overlap and treat as blunt ends, or remove/disqualify the overlapping CDS), that would allow your tool to run more robustly in a larger workflow.

rmhubley avatar Jun 07 '24 20:06 rmhubley