gatk
gatk copied to clipboard
No non variant sites
I use GenomicsDBImport to generate a GenomicsDB workspace, the parameter --include-non-variant-sites of GenotypeGVCFs has no effect.No non-variant sites were obtained.And the number of folders under --genomicsdb-workspace-path is less than the number of lines in the bed file provided with the -L parameter.
@nalinigans and/or @mlathara, any thoughts on this one?
@KPS-Malpe Can you check your original GVCFs (ie., the inputs to GenomicsDBImport) to make sure the non-variant locations you're interested in are actually covered by GVCF records?
There are a total of 1115 samples, I tested one of non variant sites, and the site can be seen in 33 GVCF. I tested using CombineGVCFs to generate cohort_all.g.vcf.gz, and then using the --include-non-variant-sites parameter of GenotypeGVCFs to get those non-variant sites. But the cycle is long, taking about six days.
@KPS-Malpe To start it would be useful to know what commandline you used, what version of GATK you're using, basic configuration info like that. If you could include the output of the commands you ran that could also be pertinent.
It is interesting that not all the intervals you specified in your bed file show up in the Genomicsdb workspace. Along the lines of the question asked by @droazen, I presume you checked that the missing intervals do actually have data in the gvcfs?
Lastly, you could try running SelectVariants on the Genomicsdb workspace -- that should return the data that was ingested into the genomicsdb, i.e., matching the input gvcfs.
The command line I use is as /opt/reseq_softwares/gatk-4.1.8.0/gatk GenomicsDBImport --java-options "-Xmx100g -Xms100g -DGATK_STACKTRACE_ON_USER_EXCEPTION=true" -R /mnt/nvme1/reference/Gallus_gallus/Ensembl_g6a/new_20220624/Gallus_gallus.GRCg6a.dna.toplevel.fa --sample-name-map samplelist -L chr33.bed --genomicsdb-workspace-path chr33.db.The output log file is as follows,Using GATK jar /mnt/nvme1/opt/reseq_softwares/gatk-4.1.8.0/gatk-package-4.1.8.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx100g -Xms100g -DGATK_STACKTRACE_ON_USER_EXCEPTION=true -jar /mnt/nvme1/opt/reseq_softwares/gatk-4.1.8.0/gatk-package-4.1.8.0-local.jar GenomicsDBImport -R /mnt/nvme1/reference/Gallus_gallus/Ensembl_g6a/Gallus_gallus.GRCg6a.dna.toplevel.fa --sample-name-map samplelist -L chr33.bed --genomicsdb-workspace-path chr33.db
11:19:39.692 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/mnt/nvme1/opt/reseq_softwares/gatk-4.1.8.0/gatk-package-4.1.8.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
Jul 04, 2022 11:19:40 AM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
11:19:40.099 INFO GenomicsDBImport - ------------------------------------------------------------
11:19:40.099 INFO GenomicsDBImport - The Genome Analysis Toolkit (GATK) v4.1.8.0
11:19:40.100 INFO GenomicsDBImport - For support and documentation go to https://software.broadinstitute.org/gatk/
11:19:40.100 INFO GenomicsDBImport - Executing as maosong@nygpu on Linux v3.10.0-1160.45.1.el7.x86_64 amd64
11:19:40.100 INFO GenomicsDBImport - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_312-b07
11:19:40.100 INFO GenomicsDBImport - Start Date/Time: 2022年7月4日 上午11时19分39秒
11:19:40.100 INFO GenomicsDBImport - ------------------------------------------------------------
11:19:40.100 INFO GenomicsDBImport - ------------------------------------------------------------
11:19:40.101 INFO GenomicsDBImport - HTSJDK Version: 2.22.0
11:19:40.101 INFO GenomicsDBImport - Picard Version: 2.22.8
11:19:40.101 INFO GenomicsDBImport - HTSJDK Defaults.COMPRESSION_LEVEL : 2
11:19:40.101 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
11:19:40.101 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
11:19:40.101 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
11:19:40.101 INFO GenomicsDBImport - Deflater: IntelDeflater
11:19:40.101 INFO GenomicsDBImport - Inflater: IntelInflater
11:19:40.101 INFO GenomicsDBImport - GCS max retries/reopens: 20
11:19:40.102 INFO GenomicsDBImport - Requester pays: disabled
11:19:40.102 INFO GenomicsDBImport - Initializing engine
11:19:40.385 INFO FeatureManager - Using codec BEDCodec to read file file:///mnt/data/project/reseq/KPSNY042021067K/result/03.bwa_dup_gvcf/geno/chr33.bed
11:19:40.390 INFO IntervalArgumentCollection - Processing 10664 bp from intervals
11:19:40.391 WARN GenomicsDBImport - A large number of intervals were specified. Using more than 100 intervals in a single import is not recommended and can cause performance to suffer. If GVCF data only exists within those intervals, performance can be improved by aggregating intervals with the merge-input-intervals argument.
11:19:40.429 INFO GenomicsDBImport - Done initializing engine
11:19:40.624 INFO GenomicsDBLibLoader - GenomicsDB native library version : 1.3.0-e701905
11:19:40.625 INFO GenomicsDBImport - Vid Map JSON file will be written to /mnt/data/chr33.db/vidmap.json
11:19:40.625 INFO GenomicsDBImport - Callset Map JSON file will be written to /mnt/data/chr33.db/callset.json
11:19:40.625 INFO GenomicsDBImport - Complete VCF Header will be written to /mnt/data/chr33.db/vcfheader.vcf
11:19:40.625 INFO GenomicsDBImport - Importing to workspace - /mnt/data/chr33.db
11:19:40.625 INFO ProgressMeter - Starting traversal
11:19:40.625 INFO ProgressMeter - Current Locus Elapsed Minutes Batches Processed Batches/Minute
11:19:49.073 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:20:12.073 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:20:32.582 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:20:50.914 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:21:10.429 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:21:29.249 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:21:49.299 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:22:08.978 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:22:27.150 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:22:48.914 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:23:09.891 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:23:28.327 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:23:50.244 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:24:03.967 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:24:17.835 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:24:33.800 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:24:48.075 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:25:05.568 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:25:20.209 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:25:38.770 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:25:57.203 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:26:12.596 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:26:30.878 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:26:51.708 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:27:13.007 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:27:31.225 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:27:49.459 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:28:10.355 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:28:31.754 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:28:53.847 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:29:05.657 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:29:22.003 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:29:36.560 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:29:53.492 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:30:10.730 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:30:27.552 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:30:46.076 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:31:03.520 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:31:22.888 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:31:40.408 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:31:51.527 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:32:03.966 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:32:17.200 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:32:30.755 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:32:45.859 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:33:00.257 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:33:16.484 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:33:35.165 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:33:53.423 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:34:14.046 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:34:34.866 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:34:47.528 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:34:59.818 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:35:15.632 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:35:29.116 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:35:45.619 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:36:00.056 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:36:14.941 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:36:30.363 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:36:48.662 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:37:07.844 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:37:24.658 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:37:42.177 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:38:02.359 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:38:23.596 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:38:36.811 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:38:50.940 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:39:04.480 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:39:21.126 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:39:35.436 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:39:52.798 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:40:07.567 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:40:24.563 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:40:44.364 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:41:05.096 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:41:25.756 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:41:46.973 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:42:10.176 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:42:33.851 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:42:57.796 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:43:12.612 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:43:27.221 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:43:44.904 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:44:02.541 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:44:19.709 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:44:36.905 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:44:54.929 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:45:13.453 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:45:34.036 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:45:53.203 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:46:13.277 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:46:34.771 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:46:57.089 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:47:17.825 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:47:41.662 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:48:06.188 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:48:23.634 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:48:39.044 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:48:55.406 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:49:15.190 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:49:33.519 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:49:52.668 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:50:10.842 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:50:30.431 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:50:52.070 INFO GenomicsDBImport - Importing batch 1 with 1115 samples
11:51:08.671 INFO ProgressMeter - 33:188091 31.5 1 0.0
11:51:08.672 INFO GenomicsDBImport - Done importing batch 1/1
11:51:08.673 INFO ProgressMeter - 33:188091 31.5 1 0.0
11:51:08.673 INFO ProgressMeter - Traversal complete. Processed 1 total batches in 31.5 minutes.
11:51:08.673 INFO GenomicsDBImport - Import completed!
11:51:08.674 INFO GenomicsDBImport - Shutting down engine
[2022年7月4日 上午11时51分08秒] org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport done. Elapsed time: 31.48 minutes.
Runtime.totalMemory()=103629717504
I counted 106 output folders, but chr33.bed has 109 sites.
I use command as folloew /opt/reseq_softwares/gatk-4.1.8.0/gatk GenotypeGVCFs --java-options "-Xmx300g" -R /mnt/nvme1/reference/Gallus_gallus/Ensembl_g6a/Gallus_gallus.GRCg6a.dna.toplevel.fa -V gvcf/cohort_all.g.vcf.gz -O Haplocaller.region.raw.vcf --include-non-variant-sites -L designed.interval_list
The all sites are found.
I use command as /opt/reseq_softwares/gatk-4.1.8.0/gatk SelectVariants --java-options "-Xmx100g -Xms100g -DGATK_STACKTRACE_ON_USER_EXCEPTION=true" -R /mnt/nvme1/reference/Gallus_gallus/Ensembl_g6a/Gallus_gallus.GRCg6a.dna.toplevel.fa -V gendb://chr33.db -L 33 -O output.chr33.vcf,then I get the 109 sites.Should I get VCF for each db separately and then merge VCF, instead of merging gendb together to call VCF?
No - you should use GenomicsDB as the input to GenotypeGVCFs. My suggestion to use SelectVariants was only a debugging step to check whether GenomicsDB included the expected input data.
Also, I misread your original post -- your GVCFs have literally 109 sites? And the interval list/bed file has 109 sites, not 109 regions? I would strongly recommend using a single interval in that case. Having a subfolder for each site will lead to a lot of unnecessary overhead.
Lastly, I would recommend upgrading to a newer GATK. 4.1.8.0 is exactly 2 years old it seems...there's been quite a few changes/updates since then.
edit: and to resolve why the number of subfolders in the workspace don't match the lines in the bed file -- my guess is that some of the sites in the bed file are abutting each other. The default behavior in these cases is to merge these into a single interval. Look at --interval-merging-rule for more information