lumpy-sv icon indicating copy to clipboard operation
lumpy-sv copied to clipboard

Excessive run time?

Open gwct opened this issue 6 years ago • 5 comments

Hi all,

I'm running your SV detection software on several ~40X primate genomes. The problem I'm encountering is excessive run times for lumpy (and the rest of the software associated with your pipelines, such as svtyper). Initial runs on a single individual never completed, so I broke up the runs by chromosome. However, each chromosome in a single individual still takes about 20 hours to finish with lumpy. From asking around I get the impression that this isn't normally the case. If that's true, do you have any explanation for why I might be seeing such excessive run times?

Thanks in advance. -Gregg Thomas

gwct avatar Feb 03 '19 19:02 gwct

I think that you are getting bogged down in low complexity regions. Do you have these annotated for your genomes?

On Sun, Feb 3, 2019 at 12:23 PM Gregg WC Thomas [email protected] wrote:

Hi all,

I'm running your SV detection software on several ~40X primate genomes. The problem I'm encountering is excessive run times for lumpy (and the rest of the software associated with your pipelines, such as svtyper). Initial runs on a single individual never completed, so I broke up the runs by chromosome. However, each chromosome in a single individual still takes about 20 hours to finish with lumpy. From asking around I get the impression that this isn't normally the case. If that's true, do you have any explanation for why I might be seeing such excessive run times?

Thanks in advance. -Gregg Thomas

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/arq5x/lumpy-sv/issues/288, or mute the thread https://github.com/notifications/unsubscribe-auth/AAlDUSEEKcPDtV9yV97T2BSNzGtnfqxsks5vJzcmgaJpZM4agREi .

ryanlayer avatar Feb 04 '19 18:02 ryanlayer

Hi Ryan, Thanks for your response. I've tried providing a RepeatMasker bed file from my reference genome using the -x flag and unfortunately its still taking a long time. I've let it run for nearly 2 days and the output vcf file has only ~6000 lines written (out of about 25000 from running lumpy on chromosomes separately in this individual).

I should note that I'm using the version of lumpy packed with speedseq with the following command:

lumpyexpress -B $bamfile -S $splitfile -D $discfile -R $ref -o $ind-lumpy.vcf -x $excfile -P

I see no option for multi-threading in the documentation, but if I've overlooked that please let me know. The split reads and discordants files were generated by running samtools as per the lumpy README page. Ideally, I would run speedseq sv, however that is throwing an error for me so I thought I'd try lumpy and svtyper separately.

Please let me know if there's anything else I should try! -Gregg

gwct avatar Feb 08 '19 15:02 gwct

Hmm, it looks like I may have spoken 2 hours too soon... lumpy finished on this individual in 47 hours, though with many fewer calls than my previous estimate. I suppose that is to be expected given that I'm excluding many parts of the genome now.

2 days is acceptable considering I can run all individuals in parallel. Is that about what you would expect for this type of data?

Again, thanks for your help. -Gregg

gwct avatar Feb 08 '19 18:02 gwct

I am not really sure what I expect here, but it is good news that excluding noisy regions reduces the number of calls.

What % of your calls are BNDs between contigs?

On Feb 8, 2019, at 11:24 AM, Gregg WC Thomas [email protected] wrote:

Hmm, it looks like I may have spoken 2 hours too soon... lumpy finished on this individual in 47 hours, though with many fewer calls than my previous estimate. I suppose that is to be expected given that I'm excluding many parts of the genome now.

2 days is acceptable considering I can run all individuals in parallel. Is that about what you would expect for this type of data?

Again, thanks for your help. -Gregg

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

ryanlayer avatar Feb 13 '19 02:02 ryanlayer

Looks like a little less than half are BNDs... 2638 out of 5741. However, not all of these appear to be between chromosomes (if I'm understanding the notation correctly).

For example,

chr1 12125047 34_1 N [chr1:12147030[N

Of those 2638, only 1138 are between chromosomes, ie:

chr1 60128593 3839_1 N N]chr2:33919357]

Hopefully that is a normal enough proportion. If there's no other way to speed it up you can go ahead and close this issue. Thanks! -Gregg

gwct avatar Feb 14 '19 19:02 gwct