apps-scripts icon indicating copy to clipboard operation
apps-scripts copied to clipboard

questions about ppbm2 option for repeat analysis tool

Open wenluo711 opened this issue 5 years ago • 1 comments

Hi, We are currently using the pacbio repeat analysis tool to analysis our targeted data on a repeat region. I'm wondering what does the -E 0 -L 0 mean. From pbmm2 manual itself, the explanation are very brief, -E,--gap-extend-2 Gap extension penalty 2. [-1] -L,--lj-min-ratio Long join flank ratio. [-1]

What it looks like is for repeat analysis, we are trying to reduce the gap extention penalty form 2 to 0? What is the difference between --gap-extend-2 and --gap-extend-1? -L in minmap is to move long CIGAR to CG tag. Is the -L option for ppbm2 same as minimap2? Can't figure out what does long join flank ratio mean from here...

When we open bams created from ccs with mapping with/without these options, they do seems different from IGV view, but the repeat analysis tool result are still same...

Thanks, Wen

wenluo711 avatar Sep 19 '19 15:09 wenluo711

Hi Wen,

The -E 0 parameter causes the gap-extension penalty to be a constant, rather than linearly increasing cost. Long expansions relative to the reference may not map through the repeat region if this penalty increases with length, instead the alignment gets clipped somewhere within the region, leaving only one side of the read mapped.

This is similar to the -L parameter in pbmm2, which specifies the --lj-min-ratio (in minimap2). This is the minimum fraction of query sequence that must be mapped to a gapped location (separate pieces) and still be kept as a 'good' mapping. We set this to a very low number so that reads with long extension regions relative to the reference and the length of the sequenced insert still get mapped. (-L [capitial] in minimap2 is used implicitly by pbmm2 https://github.com/PacificBiosciences/pbmm2#what-other-special-parameters-are-used-implicitly)

As you noticed, the extra mapping parameters make the reads map nicely (both sides of the repeat region, even when very long) and thus appear in IGV with an insertion icon (purple box with a number when long) as opposed to being soft-clipped on one side. However, as long as even one side of a read maps to the correct target region (e.g. with default mapping parameters), the python repeat analysis scripts will still find and extract the correct region. This is because the repeat analysis tools only use the full mapping to identify potentially spanning reads. To identify and extract the target region, the extraction script takes flanking sequence from the reference and maps them independently to each candidate read to find both boundaries of the target region.

Cheers, John

jrharting avatar Sep 19 '19 17:09 jrharting