gatk icon indicating copy to clipboard operation
gatk copied to clipboard

Make DepthOfCoverage multi-threaded

Open Z-Zen opened this issue 2 years ago • 3 comments

Feature request

Tool(s) or class(es) involved

Tool/class name(s), special parameters? DepthOfCoverage

Description

Are there plans to make DepthOfCoverage multi-threaded? If not, would it be possible to require such improvements?


Z-Zen avatar Jun 10 '22 12:06 Z-Zen

It was a feature which we would have loved, but alas this isn't the case. We also relied on -ct to get percent of bases depending on their coverage (ex. 20x) which has now been dropped in GATK 4+ versions.

We came across another tool called mosdepth. When compared to DepthOfCoverage -

  • It uses multithreading (albeit only for deflation, so no performance gains when going beyond 4 threads).
  • It gives coverage for exome within 5 minutes, and even faster when we don't need the per base coverage output.
  • Per base coverage output can be skipped using -x the output of this matches closely to output from DepthOfCoverage. Do keep in mind, DepthOfCoverage also supports this skipping when using the parameter --omitDepthOutputAtEachBase which saves massively on I/O and cuts processing time from 50 minutes per sample to 40 minutes per sample.

If you do decide to give it a try, we have some tips and suggestions -

  • The tool generates multiple output files. If looking for total coverage, check the last line of file output.mosdepth.summary.txt
  • If looking for percent of bases covered at target read depth, this information is present in file output.mosdepth.region.dist.txt. If your target read depth is 20x, you can search this file with grep -P "total\t20\t" and the third column should be the percentage (with only one decimal)
  • By using -d4 switch, they claim the above percentage granularity increases to 4 decimal points.

kvn95ss avatar Jun 13 '22 05:06 kvn95ss

@kvn95ss Thank you for your reply!

I was puzzled by one of your sentences stating that -ct is not available in GATK4+. However, isn't the parameter --summary-coverage-threshold in GATK4 supposed to be its equivalent?

Thanks!

Z-Zen avatar Jun 13 '22 14:06 Z-Zen

@Z-Zen woah, it indeed is equivalent. I had come across a post where it was mentioned -ct is not supported. While the reply did ask the OP to read the document, there was no indication that the parameter has been replaced.

I tried it with latest gatk (4.2.6.1) with single -ct 20 and --omit-depth-output-at-each-base to speed up. It took 25 minutes, which is indeed quite faster than the older gatk3 version.

kvn95ss avatar Jun 14 '22 04:06 kvn95ss