svtools icon indicating copy to clipboard operation
svtools copied to clipboard

Difference between lmerge and prune?

Open gwct opened this issue 5 years ago • 2 comments

Hello,

I wanted to get some clarifications about these two steps in the Tutorial that seem to be doing similar things. I notice that issue #12 points out that they are similar and that prune should be eliminated, but doesn't go into details. Apologies if I missed the explanation somewhere.

In the Tutorial, lmerge is described to

merge variant calls likely representing the same variant

Later on, prune is described to

filter out additional variant calls likely representing the same variant

It seems like lmerge is combining similar variants while prune is getting rid of them. Is that correct? Would it make sense to simply run lmerge again at the end instead of prune? If you have any other details about these two steps that you think are important that would be appreciated!

Thanks in advance. -Gregg Thomas

gwct avatar Apr 04 '19 14:04 gwct

Hi Gregg,

These are good questions (and apologies for the delay).

The issue #12 is quite old (we should probably remove it)—at the time, we were concerned that there might have been a bug in lmerge that was allowing some very similar variant calls to persist and later ‘require’ pruning, but this did not turn out to be the case.

It’s true that lmerge and prune are similar in function, and it’s also true that lmerge is combining variants whereas pruning is removing them.

We use lmerge to merge the per-sample lumpy calls, when the goal is to merge and retain as much information as possible, as this will improve the genotyping accuracy. in lmerge, we merge the breakpoint distributions, sum the split read and paired-read evidence, etc. Lmerge is necessary to combine variants across multiple samples. Pruning is more of an optional, ‘clean-up’. step. If, post-genotyping, two variants that remain are very close together, we often want to assume that they are the same (and perhaps ‘look’ slightly different due to the presence of a simple repeat, for instance). At that point, we simply want to pick the ‘best’ one (normally based on allele frequency), and remove the other(s). We don’t do any combination of breakpoint probability distribution or evidence in this case.

It’s true that you could do 2 rounds of lmerge instead of (or in addition to) prune. We sometimes do this for very large callsets but the workflow is correspondingly different.

Haley

On Apr 4, 2019, at 9:59 AM, Gregg WC Thomas [email protected] wrote:

Hello,

I wanted to get some clarifications about these two steps in the Tutorial https://github.com/hall-lab/svtools/blob/master/Tutorial.mdthat seem to be doing similar things. I notice that issue #12 https://github.com/hall-lab/svtools/issues/12 points out that they are similar and that prune should be eliminated, but doesn't go into details. Apologies if I missed the explanation somewhere.

In the Tutorial https://github.com/hall-lab/svtools/blob/master/Tutorial.md, lmerge is described to

merge variant calls likely representing the same variant

Later on, prune is described to

filter out additional variant calls likely representing the same variant

It seems like lmerge is combining similar variants while prune is getting rid of them. Is that correct? Would it make sense to simply run lmerge again at the end instead of prune? If you have any other details about these two steps that you think are important that would be appreciated!

Thanks in advance. -Gregg Thomas

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/hall-lab/svtools/issues/280, or mute the thread https://github.com/notifications/unsubscribe-auth/AH4xwR8yxsPpP4WIokvQRge8Vtc5TZ0Mks5vdhNAgaJpZM4cdBKp.

abelhj avatar Apr 19 '19 02:04 abelhj

Hi Haley,

Thanks for the info, its very useful. I went ahead and tried an lmerge as the last step instead of prune, but I can't seem to get it to work. I think this is ok -- I can just go ahead with prune as the last step, but I thought you might want to know what I was trying and what was going wrong.

If I do vcfpaste and then lmerge with -g, all the FORMAT fields for all samples are converted to ., including the genotype. This is much like the other issue I posted earlier (#277). I fixed that issue by following the Tutorial more carefully, but this problem seems similar.

If I instead try lsort and then lmerge I run into problems because the vcf files from the genotype step do not contain the lumpy probability curves. I'm not sure if there's a way to adjust the workflow like you said to retain these or not.

But I'm ok moving forward with prune. I just wanted to let you know what was going on. Thanks again for your help!

-Gregg

gwct avatar Apr 22 '19 18:04 gwct