AnchorWave
AnchorWave copied to clipboard
Matching sequences align as 2 indels in paf CIGAR string
Hi,
I'm observing an apparent bug in which AnchorWave produces paf output with a CIGAR string that reports perfectly matching sequences as containing a matching insertion and deletion.
For example, see the assembly fasta file MHC.diploid_assembly.fasta. (Note: all files here need to be gunzipped before being used.) It contains two haplotype sequences, which I align to each other using AnchorWave and the anchors file MHC.anchors.fa. There's one stretch of 190 bp where the sequences match perfectly (MHC.1:256393-256582 = MHC.2:255950-256139). But when I align them, I get the output file MHC.2_to_MHC.1.paf. Its CIGAR string in this position, instead of the expected 190=
, is 21=8I8D161=
. The 8-bp identical sequence that is marked as 8I8D
is MHC.1:256414-256421 = MHC.2:255971-255978 = CAAACAAA
.
I've observed this same behavior in some other cases too. The CIGAR string contains an unexpected NIND
pattern, with N
= an integer usually lower than 8.
I'm using anchorwave v1.2.1 and paftools.js version 2.26-r1175.
Thank you in advance for any help you can provide!
Thanks so much for your detailed feedback. May I have all the commands from minimap2 mapping and your minimap2 version information, please?
Sure! I'm using minimap version 2.24-r1122. Here's my minimap2 command: minimap2 -ax splice:hq -N10 --MD <target.fasta> <source.fasta>
Here's the output bam file that corresponds to the paf file I sent before: MHC.2_to_MHC.1.bam.gz. It contains the same weird CIGAR pattern of 21=8I8D161=
instead of 190=
.
Thanks, but may I have you GFF(3) file please? I need to repeat your results to further debug the problem.
Here's MHC.gff.gz. This is the gff file I used to derive the anchors fasta file I sent earlier (MHC.anchors.fa). It's a subset of the genome-wide gff file at https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers/GRCh38_latest_genomic.gff.gz.
Hi, I have not succeed to repeat your results. For example, there is sequence named as "LOC105379664" in MHC.anchors.fa, while the GFF file tells me it is not a coding gene. Could kindly share all your commands please?
Is there a reason why I should not be including non-coding genes in my anchors fasta file? Right now I'm including the CDS sequence if it's available but also including genes that don't have CDS sequences.
I'm also unsure what exactly this has to do with the alignment problem. Shouldn't the alignment process produce clean CIGAR strings regardless of whether the sequence is part of a CDS?
Our initial idea was the coding gene is more conserved than non-coding genes. But including non-coding genes should not matter a lot.
If you did not use the anchowave gff2seq
function to generate the MHC.anchors.fa
file, it matters. AnchorWave ignores short CDS/exon records.
If you tuned the parameter of ancho-wave gff2seq
to generate the MHC.anchors.fa
file, then should turn the alignment parameters to make sure it parses the GFF file in the same as gff2seq.
Aha - thank you for that tip! You're absolutely right, I did not use anchorwave gff2seq
to make my MHC.anchors.fa
file. Instead I parsed the GFF manually. I did this because I'm analyzing a genomic subregion (MHC, ~5 Mbp) rather than a whole genome, so I needed to include as many anchors as possible, even non-coding genes.
As you can see from my MHC.anchors.fa
file, the sequence for LOC105379664 is fairly long at 3,973 bp, but some of my other sequences are much shorter - one (MIR6833) is only 61 bp. I can imagine how that might cause problems. Do you have a recommendation for minimum anchor sequence length?
In the AnchorWave pipeline, we used minimap2 to map the reference CDS to the reference and query genome. Any reference full-length CDS mapping to a reference genome position different from the GFF record would be excluded for collinear analysis. The minimap2 has a problem dealing with short CDS (a single CDS record, not the whole transcript). AnchorWave ignores those short (<20bp by default, tuned using the -m parameter) CDS records.
Thanks, this is really helpful to know! I tried tweaking my script to generate anchors files, so that it ignores exons and CDS's of length <20bp. This caused some of the anchor sequences to become shorter. But it didn't remove any anchor sequences entirely, and the anchors file still caused the same alignment error I saw before (the 21=8I8D161=
in the CIGAR string.)
Do you have a sense of what's causing the alignment error? Is it inside minimap2, perhaps?
I would need to repeat your results so I can help with debugging.
Which results are you trying to repeat? The creation of the anchors fasta file from the GFF, or the creation of the alignment paf file using anchorwave genoAli
?
I need every command, including data downloading, to repeat your results. The anchorwave genoAli
also needs the GFF file as input. I could not understand why you would like to write a script to generate the CDS.fa file.
What kind of GFF was used as an input for anchorwave genoAli
, then?
You should use the identical GFF and genome file as input for anchorwave gff2seq
and anchorwave genoAli
, and parse GFF file using identical parameters.