AnchorWave icon indicating copy to clipboard operation
AnchorWave copied to clipboard

Matching sequences align as 2 indels in paf CIGAR string

Open JingaJenga opened this issue 1 year ago • 13 comments

Hi,

I'm observing an apparent bug in which AnchorWave produces paf output with a CIGAR string that reports perfectly matching sequences as containing a matching insertion and deletion.

For example, see the assembly fasta file MHC.diploid_assembly.fasta. (Note: all files here need to be gunzipped before being used.) It contains two haplotype sequences, which I align to each other using AnchorWave and the anchors file MHC.anchors.fa. There's one stretch of 190 bp where the sequences match perfectly (MHC.1:256393-256582 = MHC.2:255950-256139). But when I align them, I get the output file MHC.2_to_MHC.1.paf. Its CIGAR string in this position, instead of the expected 190=, is 21=8I8D161=. The 8-bp identical sequence that is marked as 8I8D is MHC.1:256414-256421 = MHC.2:255971-255978 = CAAACAAA.

I've observed this same behavior in some other cases too. The CIGAR string contains an unexpected NIND pattern, with N = an integer usually lower than 8.

I'm using anchorwave v1.2.1 and paftools.js version 2.26-r1175.

Thank you in advance for any help you can provide!

JingaJenga avatar Jan 18 '24 22:01 JingaJenga

Thanks so much for your detailed feedback. May I have all the commands from minimap2 mapping and your minimap2 version information, please?

baoxingsong avatar Jan 19 '24 03:01 baoxingsong

Sure! I'm using minimap version 2.24-r1122. Here's my minimap2 command: minimap2 -ax splice:hq -N10 --MD <target.fasta> <source.fasta>

Here's the output bam file that corresponds to the paf file I sent before: MHC.2_to_MHC.1.bam.gz. It contains the same weird CIGAR pattern of 21=8I8D161= instead of 190=.

JingaJenga avatar Jan 19 '24 03:01 JingaJenga

Thanks, but may I have you GFF(3) file please? I need to repeat your results to further debug the problem.

baoxingsong avatar Jan 22 '24 02:01 baoxingsong

Here's MHC.gff.gz. This is the gff file I used to derive the anchors fasta file I sent earlier (MHC.anchors.fa). It's a subset of the genome-wide gff file at https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers/GRCh38_latest_genomic.gff.gz.

JingaJenga avatar Jan 22 '24 03:01 JingaJenga

Hi, I have not succeed to repeat your results. For example, there is sequence named as "LOC105379664" in MHC.anchors.fa, while the GFF file tells me it is not a coding gene. Could kindly share all your commands please?

baoxingsong avatar Jan 29 '24 14:01 baoxingsong

Is there a reason why I should not be including non-coding genes in my anchors fasta file? Right now I'm including the CDS sequence if it's available but also including genes that don't have CDS sequences.

I'm also unsure what exactly this has to do with the alignment problem. Shouldn't the alignment process produce clean CIGAR strings regardless of whether the sequence is part of a CDS?

JingaJenga avatar Jan 30 '24 04:01 JingaJenga

Our initial idea was the coding gene is more conserved than non-coding genes. But including non-coding genes should not matter a lot. If you did not use the anchowave gff2seq function to generate the MHC.anchors.fa file, it matters. AnchorWave ignores short CDS/exon records. If you tuned the parameter of ancho-wave gff2seq to generate the MHC.anchors.fa file, then should turn the alignment parameters to make sure it parses the GFF file in the same as gff2seq.

baoxingsong avatar Jan 30 '24 05:01 baoxingsong

Aha - thank you for that tip! You're absolutely right, I did not use anchorwave gff2seq to make my MHC.anchors.fa file. Instead I parsed the GFF manually. I did this because I'm analyzing a genomic subregion (MHC, ~5 Mbp) rather than a whole genome, so I needed to include as many anchors as possible, even non-coding genes.

As you can see from my MHC.anchors.fa file, the sequence for LOC105379664 is fairly long at 3,973 bp, but some of my other sequences are much shorter - one (MIR6833) is only 61 bp. I can imagine how that might cause problems. Do you have a recommendation for minimum anchor sequence length?

JingaJenga avatar Jan 31 '24 01:01 JingaJenga

In the AnchorWave pipeline, we used minimap2 to map the reference CDS to the reference and query genome. Any reference full-length CDS mapping to a reference genome position different from the GFF record would be excluded for collinear analysis. The minimap2 has a problem dealing with short CDS (a single CDS record, not the whole transcript). AnchorWave ignores those short (<20bp by default, tuned using the -m parameter) CDS records.

baoxingsong avatar Jan 31 '24 01:01 baoxingsong

Thanks, this is really helpful to know! I tried tweaking my script to generate anchors files, so that it ignores exons and CDS's of length <20bp. This caused some of the anchor sequences to become shorter. But it didn't remove any anchor sequences entirely, and the anchors file still caused the same alignment error I saw before (the 21=8I8D161= in the CIGAR string.)

Do you have a sense of what's causing the alignment error? Is it inside minimap2, perhaps?

JingaJenga avatar Jan 31 '24 21:01 JingaJenga

I would need to repeat your results so I can help with debugging.

baoxingsong avatar Feb 01 '24 03:02 baoxingsong

Which results are you trying to repeat? The creation of the anchors fasta file from the GFF, or the creation of the alignment paf file using anchorwave genoAli?

JingaJenga avatar Feb 01 '24 04:02 JingaJenga

I need every command, including data downloading, to repeat your results. The anchorwave genoAli also needs the GFF file as input. I could not understand why you would like to write a script to generate the CDS.fa file. What kind of GFF was used as an input for anchorwave genoAli, then? You should use the identical GFF and genome file as input for anchorwave gff2seq and anchorwave genoAli, and parse GFF file using identical parameters.

baoxingsong avatar Feb 01 '24 09:02 baoxingsong