svaba icon indicating copy to clipboard operation
svaba copied to clipboard

How to speed up / debug deduplicating VCF step?

Open jbalberge opened this issue 3 years ago • 13 comments

Running svaba on Terra/Firecloud, we have troubles at this step of svaba run for >30X Tumor/Normal WGS (from the logs)

...vcf - reading in the breakpoints file
...vcf sizeof empty VCFEntryPair 64 bytes
...read in 2,325,969 indels and 1,718,531 SVs
...vcf - deduplicating 1,718,531 events
...dedupe at 0 of 1,718,531

Actual run time is a couple of hours for variant calling, then the logs get stuck at dedupe steps for 100+ hours and counting. Have you seen that before? Is there anything we can do to debug this situation?

We tried with up to 128GB, 16CPU and 1000GB HDD VMs with svaba 1.1.0 quay docker

Thanks for your help!

jbalberge avatar Jun 04 '21 21:06 jbalberge

This is unusual behavior, but I do have a suggestion. There was an issue in the dedupe step that another user pointed out in issue #92, which I fixed in version 1.1.3.

Is there a contact person at the Broad Insitute that is in charge of maintaining the svaba docker image? You could reach out to them to update svaba to the current version here on Github, 1.1.3.

walaj avatar Jun 04 '21 22:06 walaj

Thank you for the quick reply. I used the biocontainer's docker for 1.1.0 available at https://biocontainers.pro/tools/svaba Unfortunately, upgrading a docker to v1.1.3 didn't solve the problem. Could it be that the number of events is too high?

jbalberge avatar Jun 06 '21 17:06 jbalberge

jbalberge we are having the same issue stuck for 100+ hours at this step for a couple of samples. Did you mange to fix the issue?

...vcf - reading in the breakpoints file
...vcf sizeof empty VCFEntryPair 64 bytes
...read in 1,104,990 indels and 1,739,919 SVs
...vcf - deduplicating 1,739,919 events
...dedupe at 0 of 1,739,919

The SvABA version we have been using is from some time ago. We have successfully processed hundreds of samples with this version but now a couple of samples are just stuck. We could update the version and just run for the problem samples but not sure if that would fix the issue, but then again the cohort would no longer be "harmonised".

Program: SvABA
FH Version: 134
Contact: Jeremiah Wala [ [email protected] ]

The cohorts we've analysed have germlines sequenced at ~30x and tumors from 60x to 120x.

The two problem samples have germline at 30x and the Tumor is at ~70x and ~120x. Both have been stuck at the dedupe step for 100+ hours. We have given it 200Gb for the run.

ahwanpandey avatar Jan 29 '24 20:01 ahwanpandey

@ahwanpandey This is one the memory/run weakness of svaba that I've known about but haven't had time to fix. The issue is that svaba compiles all of the variants into an intermediate file, and this file needs to be sorted and de-duped at the end to make the organized VCF. For most runs this is fine, but if the number of suspected variants is high (in your case it is very high), then the memory can run very high as it tries to read in this entire file.

The solution is really to just do what samtools sort does and do a scatter-gather sort, but I haven't been able to implement yet.

Out of curiosity, how large is the *.bps.txt.gz file for this run? That's the file that it is reading into memory.

walaj avatar Jan 31 '24 01:01 walaj

Hi @walaj thanks so much for your response. For the two samples that are stuck, the "*.bps.txt.gz " files are 147M and 131M.

We have a lot of High Grade Ovarian Cancer WGS data and they indeed have a lot of structural variants. Is there any chance you would be able to fix this issue for us? We would be very grateful. I can even share the files if that would be useful. We have already run svaba for hundreds of samples throughout the years and as you can understand it would be tricky to not be able to run the tool on a couple of samples, and probably more in the future. So again, we would be very grateful if you could have a look at fixing the issue when you get a chance.

The other option we are trying is to run the latest version of the tool. Do you think we will have the same problem with it?

I'm trying to install the latest version, but as you've noted I think I need to fix what CMAKE is doing. https://github.com/walaj/svaba/issues/132

ahwanpandey avatar Jan 31 '24 01:01 ahwanpandey

If I remember correctly this happened with short inserts; hard-trimming of adapters and PolyG must have reduced the number of candidates in my case at that time.

Le mar. 30 janv. 2024 à 20:09, Jeremiah Wala @.***> a écrit :

@ahwanpandey https://github.com/ahwanpandey This is one the memory/run weakness of svaba that I've known about but haven't had time to fix. The issue is that svaba compiles all of the variants into an intermediate file, and this file needs to be sorted and de-duped at the end to make the organized VCF. For most runs this is fine, but if the number of suspected variants is high (in your case it is very high), then the memory can run very high as it tries to read in this entire file.

The solution is really to just do what samtools sort does and do a scatter-gather sort, but I haven't been able to implement yet.

Out of curiosity, how large is the *.bps.txt.gz file for this run? That's the file that it is reading into memory.

— Reply to this email directly, view it on GitHub https://github.com/walaj/svaba/issues/101#issuecomment-1918181250, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEMR3P4Z6PBINPGPK46T74TYRGKTHAVCNFSM46DRMNJ2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOJRHAYTQMJSGUYA . You are receiving this because you authored the thread.Message ID: @.***>

jbalberge avatar Jan 31 '24 01:01 jbalberge

Hmm OK, I'm concerned given that file size of bps.txt.gz (not so big) that there is a memory clash happening somewhere that's running up the memory as a bug. There was a bug that caused some memory clashes randomly on < 5% of samples, at the dedupe stage, but I fixed it a while ago. I would think that our best approach here is to have you try with the newly built version, and you'll just have a few samples that were run with a newer version. Nothing too substantive has changed, just bug fixes and build systems, so you wouldn't have to re-run your other samples.

If you're still getting the same memory overrun issues on the latest version for these samples, I'll have to re-visit the smart sorting. But with bps files that small, I doubt that this is the issue now.

On Tue, Jan 30, 2024 at 8:22 PM ahwanpandey @.***> wrote:

Hi @walaj https://github.com/walaj thanks so much for your response. For the two samples that are stuck, the "*.bps.txt.gz " files are 147M and 131M.

We have a lot of High Grade Ovarian Cancer WGS data and they indeed have a lot of structural variants. Is there any chance you would be able to fix this issue for us? We would be very grateful. I can even share the files if that would be useful. We have already run svaba for hundreds of samples throughout the years and as you can understand it would be tricky to not be able to run the tool on a couple of samples, and probably more in the future. So again, we would be very grateful if you could have a look at fixing the issue when you get a chance.

The other option we are trying is to run the latest version of the tool. Do you think we will have the same problem with it?

I'm trying to install the latest version, but as you've noted I think I need to fix what CMAKE is doing. #132 https://github.com/walaj/svaba/issues/132

— Reply to this email directly, view it on GitHub https://github.com/walaj/svaba/issues/101#issuecomment-1918192789, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABUZ7CCROI5MZLU6TLCSNVLYRGMHFAVCNFSM46DRMNJ2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOJRHAYTSMRXHA4Q . You are receiving this because you were mentioned.Message ID: @.***>

walaj avatar Jan 31 '24 01:01 walaj

Hi @walaj I've now tried to re-run with the latest version and still got stuck at the dedupe step for two samples :/. Would it be possible for you to see if you could fix this issue for us? I can share any files you need. We would be very grateful for your time in fixing this bug.

Stuck at the following step for two out for hundreds of WGS samples.

==> AN_T_65913_1600143_21_N_65913_GL/std_out_err_AN/WGS.SvABA.STAGE0.SvABA.AN_T_65913_1600143_21_N_65913_GL.new.17918181.papr-res-compute215.err <==
-----------------------------------------------------------
---  Running svaba SV and indel detection on 8 threads ----
---    (inspect *.log for real-time progress updates)   ---
-----------------------------------------------------------
[M::bwa_idx_load_from_disk] read 0 ALT contigs
--- Loaded non-read data. Starting detection pipeline
...vcf - reading in the breakpoints file
...vcf sizeof empty VCFEntryPair 64 bytes
...read in 1,104,585 indels and 1,596,340 SVs
...vcf - deduplicating 1,596,340 events
...dedupe at 0 of 1,596,340

==> AN_T_66639_2100027_16_N_66639_GL/std_out_err_AN/WGS.SvABA.STAGE0.SvABA.AN_T_66639_2100027_16_N_66639_GL.new.17918182.papr-res-compute06.err <==
-----------------------------------------------------------
---  Running svaba SV and indel detection on 8 threads ----
---    (inspect *.log for real-time progress updates)   ---
-----------------------------------------------------------
[M::bwa_idx_load_from_disk] read 0 ALT contigs
--- Loaded non-read data. Starting detection pipeline
...vcf - reading in the breakpoints file
...vcf sizeof empty VCFEntryPair 64 bytes
...read in 1,074,831 indels and 1,282,024 SVs
...vcf - deduplicating 1,282,024 events
...dedupe at 0 of 1,282,024

The output directory contents so far

image

Latest SVABA VERSION where issue persists

------------------------------------------------------------
-------- SvABA - SV and indel detection by assembly --------
------------------------------------------------------------
Program: SvABA
Version: 1.1.3
Contact: Jeremiah Wala [ [email protected] ]
Usage: svaba <command> [options]

Commands:
           run            Run SvABA SV and Indel detection on BAM(s)
           refilter       Refilter the SvABA breakpoints with additional/different criteria to created filtered VCF and breakpoints file.

Report bugs to [email protected]

Old version where issue was first observed

------------------------------------------------------------
--- SvABA (sah-bah) - SV and indel detection by assembly ---
------------------------------------------------------------
Program: SvABA
FH Version: 134
Contact: Jeremiah Wala [ [email protected] ]
Usage: svaba <command> [options]

Commands:
           run            Run SvABA SV and Indel detection on BAM(s)
           refilter       Refilter the SvABA breakpoints with additional/different criteria to created filtered VCF and breakpoints file.

Report bugs to [email protected]

ahwanpandey avatar Feb 03 '24 04:02 ahwanpandey

@walaj is there any chance you could have a look at this issue for us? We would be very grateful for the help. Thanks so much.

ahwanpandey avatar Feb 11 '24 21:02 ahwanpandey

This is fixed in the latest commit (d9f37dbc40ed783b5758389405113ac2a0dfbd82)

walaj avatar Mar 22 '24 14:03 walaj

@walaj Thanks for all the help so far.

I have now downloaded the latest commit and processed some old samples using the old version (as mentioned in this issue) as well as the latest commit ( fcfa17e ). The results are drastically different in the number of passing somatic SVs. See plot below summarized for each chromosome across two samples (latest commit results in orange bars)

image

I noticed that in the new commit's log file there are lots of messages saying "with limit hit of 0" whereas not so much in the old version. Not sure if this has to do with anything. I also ran the new version with 16 threads instead of 8 in the old version. I'll try to run with 8 threads and see if that fixes anything? Do you have any ideas? Thanks again.

OLD VERSION

]$ cat AN_T_66639_2100027_14_N_66639_GL.log | grep "with limit hit of" | head -n 40
writing contigs etc on thread 140475294115584 with limit hit of 796
writing contigs etc on thread 140475302508288 with limit hit of 474
writing contigs etc on thread 140475260544768 with limit hit of 1353
writing contigs etc on thread 140475285722880 with limit hit of 2536
writing contigs etc on thread 140475277330176 with limit hit of 2743
writing contigs etc on thread 140469615314688 with limit hit of 3811
writing contigs etc on thread 140475294115584 with limit hit of 336
writing contigs etc on thread 140475268937472 with limit hit of 1780
writing contigs etc on thread 140475302508288 with limit hit of 307
writing contigs etc on thread 140475310900992 with limit hit of 1795
writing contigs etc on thread 140475285722880 with limit hit of 552
writing contigs etc on thread 140475277330176 with limit hit of 916
writing contigs etc on thread 140475302508288 with limit hit of 574
writing contigs etc on thread 140475310900992 with limit hit of 437
writing contigs etc on thread 140475260544768 with limit hit of 1059
writing contigs etc on thread 140475285722880 with limit hit of 1293
writing contigs etc on thread 140475268937472 with limit hit of 2951
writing contigs etc on thread 140475294115584 with limit hit of 4241
writing contigs etc on thread 140469615314688 with limit hit of 5049
writing contigs etc on thread 140475302508288 with limit hit of 8076
writing contigs etc on thread 140475310900992 with limit hit of 4492
writing contigs etc on thread 140475277330176 with limit hit of 5499
writing contigs etc on thread 140475294115584 with limit hit of 6412
writing contigs etc on thread 140475268937472 with limit hit of 5956
writing contigs etc on thread 140475285722880 with limit hit of 16232
writing contigs etc on thread 140475260544768 with limit hit of 15423
writing contigs etc on thread 140469615314688 with limit hit of 7244
writing contigs etc on thread 140475302508288 with limit hit of 6837
writing contigs etc on thread 140475310900992 with limit hit of 8440
writing contigs etc on thread 140475268937472 with limit hit of 8838
writing contigs etc on thread 140475260544768 with limit hit of 7990
writing contigs etc on thread 140475260544768 with limit hit of 7990
writing contigs etc on thread 140475260544768 with limit hit of 7990
writing contigs etc on thread 140475260544768 with limit hit of 7990
writing contigs etc on thread 140475294115584 with limit hit of 13428
writing contigs etc on thread 140475285722880 with limit hit of 8048
writing contigs etc on thread 140475277330176 with limit hit of 11336
writing contigs etc on thread 140469615314688 with limit hit of 7874
writing contigs etc on thread 140475310900992 with limit hit of 8119
writing contigs etc on thread 140475302508288 with limit hit of 8213

LATEST COMMIT

]$ cat AN_T_66639_2100027_14_N_66639_GL.log | grep "with limit hit of" | head -n 40
writing contigs etc on thread 139868850484992 with limit hit of 0
writing contigs etc on thread 139868858877696 with limit hit of 0
writing contigs etc on thread 139868884055808 with limit hit of 0
writing contigs etc on thread 139868833699584 with limit hit of 0
writing contigs etc on thread 139868825306880 with limit hit of 0
writing contigs etc on thread 139874457597696 with limit hit of 0
writing contigs etc on thread 139874564609792 with limit hit of 0
writing contigs etc on thread 139874581395200 with limit hit of 0
writing contigs etc on thread 139874491168512 with limit hit of 0
writing contigs etc on thread 139874465990400 with limit hit of 0
writing contigs etc on thread 139874573002496 with limit hit of 0
writing contigs etc on thread 139868867270400 with limit hit of 0
writing contigs etc on thread 139868858877696 with limit hit of 0
writing contigs etc on thread 139868875663104 with limit hit of 0
writing contigs etc on thread 139874482775808 with limit hit of 0
writing contigs etc on thread 139868842092288 with limit hit of 0
writing contigs etc on thread 139868833699584 with limit hit of 0
writing contigs etc on thread 139868884055808 with limit hit of 0
writing contigs etc on thread 139874474383104 with limit hit of 0
writing contigs etc on thread 139874564609792 with limit hit of 0
writing contigs etc on thread 139868825306880 with limit hit of 0
writing contigs etc on thread 139874457597696 with limit hit of 0
writing contigs etc on thread 139874491168512 with limit hit of 0
writing contigs etc on thread 139874581395200 with limit hit of 0
writing contigs etc on thread 139868850484992 with limit hit of 0
writing contigs etc on thread 139874573002496 with limit hit of 0
writing contigs etc on thread 139874465990400 with limit hit of 0
writing contigs etc on thread 139868884055808 with limit hit of 0
writing contigs etc on thread 139868858877696 with limit hit of 0
writing contigs etc on thread 139868867270400 with limit hit of 0
writing contigs etc on thread 139868825306880 with limit hit of 0
writing contigs etc on thread 139868842092288 with limit hit of 0
writing contigs etc on thread 139874482775808 with limit hit of 0
writing contigs etc on thread 139874474383104 with limit hit of 0
writing contigs etc on thread 139868875663104 with limit hit of 0
writing contigs etc on thread 139868833699584 with limit hit of 0
writing contigs etc on thread 139874457597696 with limit hit of 0
writing contigs etc on thread 139874491168512 with limit hit of 0
writing contigs etc on thread 139868850484992 with limit hit of 0
writing contigs etc on thread 139874564609792 with limit hit of 0

ahwanpandey avatar May 08 '24 05:05 ahwanpandey