gatk icon indicating copy to clipboard operation
gatk copied to clipboard

No non variant sites

Open KPS-Malpe opened this issue 3 years ago • 7 comments

I use GenomicsDBImport to generate a GenomicsDB workspace, the parameter --include-non-variant-sites of GenotypeGVCFs has no effect.No non-variant sites were obtained.And the number of folders under --genomicsdb-workspace-path is less than the number of lines in the bed file provided with the -L parameter.

KPS-Malpe avatar Jul 20 '22 01:07 KPS-Malpe

@nalinigans and/or @mlathara, any thoughts on this one?

droazen avatar Jul 25 '22 19:07 droazen

@KPS-Malpe Can you check your original GVCFs (ie., the inputs to GenomicsDBImport) to make sure the non-variant locations you're interested in are actually covered by GVCF records?

droazen avatar Jul 25 '22 19:07 droazen

There are a total of 1115 samples, I tested one of non variant sites, and the site can be seen in 33 GVCF. I tested using CombineGVCFs to generate cohort_all.g.vcf.gz, and then using the --include-non-variant-sites parameter of GenotypeGVCFs to get those non-variant sites. But the cycle is long, taking about six days.

KPS-Malpe avatar Jul 27 '22 02:07 KPS-Malpe

@KPS-Malpe To start it would be useful to know what commandline you used, what version of GATK you're using, basic configuration info like that. If you could include the output of the commands you ran that could also be pertinent.

It is interesting that not all the intervals you specified in your bed file show up in the Genomicsdb workspace. Along the lines of the question asked by @droazen, I presume you checked that the missing intervals do actually have data in the gvcfs?

Lastly, you could try running SelectVariants on the Genomicsdb workspace -- that should return the data that was ingested into the genomicsdb, i.e., matching the input gvcfs.

mlathara avatar Jul 27 '22 02:07 mlathara

The command line I use is as /opt/reseq_softwares/gatk-4.1.8.0/gatk GenomicsDBImport --java-options "-Xmx100g -Xms100g -DGATK_STACKTRACE_ON_USER_EXCEPTION=true" -R /mnt/nvme1/reference/Gallus_gallus/Ensembl_g6a/new_20220624/Gallus_gallus.GRCg6a.dna.toplevel.fa --sample-name-map samplelist -L chr33.bed --genomicsdb-workspace-path chr33.db.The output log file is as follows,Using GATK jar /mnt/nvme1/opt/reseq_softwares/gatk-4.1.8.0/gatk-package-4.1.8.0-local.jar Running: java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx100g -Xms100g -DGATK_STACKTRACE_ON_USER_EXCEPTION=true -jar /mnt/nvme1/opt/reseq_softwares/gatk-4.1.8.0/gatk-package-4.1.8.0-local.jar GenomicsDBImport -R /mnt/nvme1/reference/Gallus_gallus/Ensembl_g6a/Gallus_gallus.GRCg6a.dna.toplevel.fa --sample-name-map samplelist -L chr33.bed --genomicsdb-workspace-path chr33.db 11:19:39.692 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/mnt/nvme1/opt/reseq_softwares/gatk-4.1.8.0/gatk-package-4.1.8.0-local.jar!/com/intel/gkl/native/libgkl_compression.so Jul 04, 2022 11:19:40 AM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine INFO: Failed to detect whether we are running on Google Compute Engine. 11:19:40.099 INFO GenomicsDBImport - ------------------------------------------------------------ 11:19:40.099 INFO GenomicsDBImport - The Genome Analysis Toolkit (GATK) v4.1.8.0 11:19:40.100 INFO GenomicsDBImport - For support and documentation go to https://software.broadinstitute.org/gatk/ 11:19:40.100 INFO GenomicsDBImport - Executing as maosong@nygpu on Linux v3.10.0-1160.45.1.el7.x86_64 amd64 11:19:40.100 INFO GenomicsDBImport - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_312-b07 11:19:40.100 INFO GenomicsDBImport - Start Date/Time: 2022年7月4日 上午11时19分39秒 11:19:40.100 INFO GenomicsDBImport - ------------------------------------------------------------ 11:19:40.100 INFO GenomicsDBImport - ------------------------------------------------------------ 11:19:40.101 INFO GenomicsDBImport - HTSJDK Version: 2.22.0 11:19:40.101 INFO GenomicsDBImport - Picard Version: 2.22.8 11:19:40.101 INFO GenomicsDBImport - HTSJDK Defaults.COMPRESSION_LEVEL : 2 11:19:40.101 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false 11:19:40.101 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true 11:19:40.101 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false 11:19:40.101 INFO GenomicsDBImport - Deflater: IntelDeflater 11:19:40.101 INFO GenomicsDBImport - Inflater: IntelInflater 11:19:40.101 INFO GenomicsDBImport - GCS max retries/reopens: 20 11:19:40.102 INFO GenomicsDBImport - Requester pays: disabled 11:19:40.102 INFO GenomicsDBImport - Initializing engine 11:19:40.385 INFO FeatureManager - Using codec BEDCodec to read file file:///mnt/data/project/reseq/KPSNY042021067K/result/03.bwa_dup_gvcf/geno/chr33.bed 11:19:40.390 INFO IntervalArgumentCollection - Processing 10664 bp from intervals 11:19:40.391 WARN GenomicsDBImport - A large number of intervals were specified. Using more than 100 intervals in a single import is not recommended and can cause performance to suffer. If GVCF data only exists within those intervals, performance can be improved by aggregating intervals with the merge-input-intervals argument. 11:19:40.429 INFO GenomicsDBImport - Done initializing engine 11:19:40.624 INFO GenomicsDBLibLoader - GenomicsDB native library version : 1.3.0-e701905 11:19:40.625 INFO GenomicsDBImport - Vid Map JSON file will be written to /mnt/data/chr33.db/vidmap.json 11:19:40.625 INFO GenomicsDBImport - Callset Map JSON file will be written to /mnt/data/chr33.db/callset.json 11:19:40.625 INFO GenomicsDBImport - Complete VCF Header will be written to /mnt/data/chr33.db/vcfheader.vcf 11:19:40.625 INFO GenomicsDBImport - Importing to workspace - /mnt/data/chr33.db 11:19:40.625 INFO ProgressMeter - Starting traversal 11:19:40.625 INFO ProgressMeter - Current Locus Elapsed Minutes Batches Processed Batches/Minute 11:19:49.073 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:20:12.073 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:20:32.582 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:20:50.914 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:21:10.429 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:21:29.249 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:21:49.299 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:22:08.978 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:22:27.150 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:22:48.914 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:23:09.891 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:23:28.327 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:23:50.244 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:24:03.967 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:24:17.835 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:24:33.800 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:24:48.075 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:25:05.568 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:25:20.209 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:25:38.770 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:25:57.203 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:26:12.596 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:26:30.878 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:26:51.708 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:27:13.007 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:27:31.225 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:27:49.459 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:28:10.355 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:28:31.754 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:28:53.847 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:29:05.657 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:29:22.003 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:29:36.560 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:29:53.492 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:30:10.730 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:30:27.552 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:30:46.076 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:31:03.520 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:31:22.888 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:31:40.408 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:31:51.527 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:32:03.966 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:32:17.200 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:32:30.755 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:32:45.859 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:33:00.257 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:33:16.484 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:33:35.165 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:33:53.423 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:34:14.046 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:34:34.866 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:34:47.528 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:34:59.818 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:35:15.632 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:35:29.116 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:35:45.619 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:36:00.056 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:36:14.941 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:36:30.363 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:36:48.662 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:37:07.844 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:37:24.658 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:37:42.177 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:38:02.359 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:38:23.596 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:38:36.811 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:38:50.940 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:39:04.480 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:39:21.126 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:39:35.436 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:39:52.798 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:40:07.567 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:40:24.563 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:40:44.364 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:41:05.096 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:41:25.756 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:41:46.973 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:42:10.176 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:42:33.851 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:42:57.796 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:43:12.612 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:43:27.221 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:43:44.904 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:44:02.541 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:44:19.709 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:44:36.905 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:44:54.929 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:45:13.453 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:45:34.036 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:45:53.203 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:46:13.277 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:46:34.771 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:46:57.089 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:47:17.825 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:47:41.662 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:48:06.188 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:48:23.634 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:48:39.044 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:48:55.406 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:49:15.190 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:49:33.519 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:49:52.668 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:50:10.842 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:50:30.431 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:50:52.070 INFO GenomicsDBImport - Importing batch 1 with 1115 samples 11:51:08.671 INFO ProgressMeter - 33:188091 31.5 1 0.0 11:51:08.672 INFO GenomicsDBImport - Done importing batch 1/1 11:51:08.673 INFO ProgressMeter - 33:188091 31.5 1 0.0 11:51:08.673 INFO ProgressMeter - Traversal complete. Processed 1 total batches in 31.5 minutes. 11:51:08.673 INFO GenomicsDBImport - Import completed! 11:51:08.674 INFO GenomicsDBImport - Shutting down engine [2022年7月4日 上午11时51分08秒] org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport done. Elapsed time: 31.48 minutes. Runtime.totalMemory()=103629717504 I counted 106 output folders, but chr33.bed has 109 sites. I use command as folloew /opt/reseq_softwares/gatk-4.1.8.0/gatk GenotypeGVCFs --java-options "-Xmx300g" -R /mnt/nvme1/reference/Gallus_gallus/Ensembl_g6a/Gallus_gallus.GRCg6a.dna.toplevel.fa -V gvcf/cohort_all.g.vcf.gz -O Haplocaller.region.raw.vcf --include-non-variant-sites -L designed.interval_list
The all sites are found.

KPS-Malpe avatar Jul 27 '22 03:07 KPS-Malpe

I use command as /opt/reseq_softwares/gatk-4.1.8.0/gatk SelectVariants --java-options "-Xmx100g -Xms100g -DGATK_STACKTRACE_ON_USER_EXCEPTION=true" -R /mnt/nvme1/reference/Gallus_gallus/Ensembl_g6a/Gallus_gallus.GRCg6a.dna.toplevel.fa -V gendb://chr33.db -L 33 -O output.chr33.vcf,then I get the 109 sites.Should I get VCF for each db separately and then merge VCF, instead of merging gendb together to call VCF?

KPS-Malpe avatar Jul 27 '22 03:07 KPS-Malpe

No - you should use GenomicsDB as the input to GenotypeGVCFs. My suggestion to use SelectVariants was only a debugging step to check whether GenomicsDB included the expected input data.

Also, I misread your original post -- your GVCFs have literally 109 sites? And the interval list/bed file has 109 sites, not 109 regions? I would strongly recommend using a single interval in that case. Having a subfolder for each site will lead to a lot of unnecessary overhead.

Lastly, I would recommend upgrading to a newer GATK. 4.1.8.0 is exactly 2 years old it seems...there's been quite a few changes/updates since then.

edit: and to resolve why the number of subfolders in the workspace don't match the lines in the bed file -- my guess is that some of the sites in the bed file are abutting each other. The default behavior in these cases is to merge these into a single interval. Look at --interval-merging-rule for more information

mlathara avatar Jul 27 '22 03:07 mlathara