unimap icon indicating copy to clipboard operation
unimap copied to clipboard

Options to favor detection of several small deletions instead of single large ones

Open elcortegano opened this issue 4 years ago • 6 comments

unimap does a great job aligning complex regions (e.g. satellites) even using reads. We are using it to detect some variants at telomeric regions, using PacBio HiFi sequencing data.

We have evidence from one of these regions that there should be 2 separate deletions, corresponding to two separate groups of monomers from a satellite, each of them of different size. However, unimap finds one large deletion instead, close together to several variants that suggest mismapping. I assume this behavior comes from the default settings, that may favour large deletions instead of many small deletions. The program was run with unimap -a -x asm5 -x hifi --cs. I attach snapshot below for this region.

telomere_5_unimap2

I have been trying to tune settings to penalize nucleotide mismatchings and favour more than 1 deletion if convenient. However, I find some problems including:

  • -B 500 -O 1 -E 2: increased mismatch penalty and lowered cost for gaps. This results in a "Segmentation fault" error.
  • -B 500: Does not seem to solve the problem, the alignment remains the same.
  • -O 1 -E 2: This makes some improvements, and for some reads we observe that mismatches disappear in favour of deletions (snapshot below), but not for the majority of reads. Cannot make progress from here however, since using "-B > 6" will cause the segfault, lower values of "-B" won't change anything, and -O and -E are near their minimum values (cannot take values of zero).

telomere_5_unimap3

Wonder if there are other options that are worth exploring here. Any hints on how to deal with this situation? Thank you

elcortegano avatar Jun 01 '21 21:06 elcortegano

Minimap2/unimap won't work with a very large -B because internally every score must be within [-128,127]. I should have added a test for that. Nonetheless, the failure of -O1 -E2 -B7 is caused by something else that I have overlooked. Note that under this scoring, the optimal alignment will have no mismatches. You can turn any mismatch to two gaps for a better score:

ATCATCA     ATC-ATCA
ATCtTCA     ATCt-TCA

This is probably violating the SW formulation unimap/minimap2 is using. You need to make sure ({-O}+{-E})*2>{-B} at least.

In your case, you can try -A1 -B4 -O6 -E2, or -A1 -B4 -O4 -E1.

lh3 avatar Jun 02 '21 03:06 lh3

Unfortunately, these options does not solve the issue, and still the majority of reads show the large deletion + mismatches.

From what I understand, it may be possible to try something like -B100 -O1 -E50, only that this would need to be in a future release to allow B > 7? We are aware that this might return unrealistic alignments for most of the genome, but for this particular region allowing multiple deletions could be more appropriate.

Thank you for the hints and feedback,

elcortegano avatar Jun 02 '21 14:06 elcortegano

The requirement is ({-O}+{-E})*2>{-B}; otherwise the alignment largely doesn't make sense. B<=7 is not a requirement. asm5 uses -A1 -B19 -O39,81 -E2,1. If you have the sequences, I can give a try.

lh3 avatar Jun 02 '21 15:06 lh3

Thank you for the clarifications, and for offering giving it a look. I have been trying different combinations of parameters following the above rule, but with no success.

Shall I send you the files to an email address? apparently I cannot attach them here.

elcortegano avatar Jun 02 '21 18:06 elcortegano

Hello,

@elcortegano I would like to ask, how did you obtain such a nice view of the alignment? I guess using some software (?) and fetching it bam/sam format coming out of the unimap? Thank you for your response.

ilikvlad avatar Jul 18 '21 19:07 ilikvlad

Hi @ilikvlad , this is with IGV (https://software.broadinstitute.org/software/igv/download).

elcortegano avatar Jul 24 '21 15:07 elcortegano