gatk icon indicating copy to clipboard operation
gatk copied to clipboard

GenomicsDBImport [Documentation/Usage]

Open nickhsmith opened this issue 3 years ago • 8 comments

Documentation request

I am trying to follow the germline joint variant best practices and am confused by the steps/examples and how it would work using real WGS or WXS data in a parallelized setting.

Tool(s) or class(es) involved

GenomicsDBImport v.4.2.6.1 (current)

Description

As far as I understand it the joint germline variant calling process is like this (imagine 100 samples):

  1. Call variants using Haplotypecaller using the gVCF output flag for each sample
  2. use the multiple gVCFs (1 per sample) and a set of intervals (WGS_intervals.bed as an example) to build a Genomics DB store using GenomicsDBImport
  3. Use GenotypeGVCFs using the output of GenomicsDBImport as the input to consolidate the multiple samples into 1 multi-sample vcf

My question comes from the parallelization/interval splitting during step 2. If I parallelize the GenomicsDBImport across each interval. I would end up with ~300 intervals and subsequently, ~300 GenomicsDB directory paths since I am not adding new samples to an existing DB, then the specified output DB path, "Must be an empty or non-existent directory", which will contain the relevant interval calls for the 100 samples.

Am I supposed to use the 300 directory paths as input into a single GenotypeGVCFs call? Or process each of the 300 intervals into 300 multi-sample vcf files (each with 100 samples) and then merge those into a single vcf file using GatherVcfs or some other merging tool.

The examples posted and documentation for GenomicsDBImport relay the need for intervals to work effectively, and so does an old broad lecture recording.

Essentially it boils down to when and how to process and merge the same set of samples (100) over the many intervals (300). If I had 300 compute nodes (as an example) I want to parallelize as much of this as possible. so that each node can process an interval set, and at the end of the process I have 1 VCF file with 100 samples covering the entire range of intervals.

I hope that was clear. Please let me know if you need any more info, or if I should be asking somewhere else. Thanks in advance!


nickhsmith avatar Jun 14 '22 07:06 nickhsmith

The output of GenomicsDBImport is a directory with a specific structure underneath. The required option here, https://gatk.broadinstitute.org/hc/en-us/articles/5358869876891-GenomicsDBImport#--genomicsdb-workspace-path, provides the "name" of this directory, but you really don't need to worry about the underlying structure. When you provide GenotypeGVCFs the variants, https://gatk.broadinstitute.org/hc/en-us/articles/5358906861083-GenotypeGVCFs#--variant, you're providing that directory. The gendb:// is the beginning of a URI that tells GATK how to read/interact with that structure just like the https:// at the beginning of a web URL.

The -L argument for GenomicsDBImport tells GATK which intervals to look at in the provided samples' data. You may need to adjust your --batch-size as you add more samples or use --consolidate. I've found that with large numbers of intervals and samples I need to use --merge-input-intervals or else the bookkeeping files stored in temp space exceed anything anything reasonable. HaplotypeCaller merges intervals by default, but GenotypeGVCFs does not.

In short, your --genomicsdb-workspace-path MY_DB should be the same as your -V gendb://MY_DB.

mjohnsonngi avatar Jun 16 '22 18:06 mjohnsonngi

Sorry that's not quite my question. Once I generate 2 directories through genomic import using 2 intervals because it was faster to parallelize them in 2 separate processes. Can I combine or merge these directories before the genotypegvcf step? Or do I use gathervcfs after that to merge the 2 vcf files?

nickhsmith avatar Jun 16 '22 20:06 nickhsmith

In that case, I believe you want GatherVcfs after GenotypeGVCFs if the intervals don't overlap. We typically keep them separate, but I've used GatherVcfs in the past with a large number of samples.

mjohnsonngi avatar Jun 16 '22 20:06 mjohnsonngi

@mlathara and/or @nalinigans , what are your current recommendations for parallelizing a GenomicsDB-based pipeline?

droazen avatar Jul 05 '22 19:07 droazen

The only parallelism exposed by GenomicsDBImport is via max-num-intervals-to-import-in-parallel and that is at the granularity of threads in the same process. Not sure there is any parallelism in GenotypeGVCFs. The granularity of parallelism that @nickhsmith seems to want is at the level of a node and @mjohnsonngi is correct. Assuming GenotypeGVCFs processes each position independently, run GenomicsDBImport followed by GenotypeGVFs per node and then use GatherVcfs.

nalinigans avatar Jul 05 '22 22:07 nalinigans

Hi, if I have 700 samples, can I use batch-size=700?

wichadak avatar Jan 23 '23 07:01 wichadak

Hi, if I have 700 samples, can I use batch-size=700?

You could, but using 0 for batch size has the same effect of not using batches. It's the default, so not including --batch-size should have the same effect.

mjohnsonngi avatar Jan 23 '23 15:01 mjohnsonngi