gatk
gatk copied to clipboard
GenomicsDBImport sorting issue
Bug Report
Affected tool(s) or class(es)
GenomicsDBImport
Affected version(s)
Latest
Description
Output from GenomicsDBImport is sorting based on alphabetic order: chr 10 ahead of chr 7
Steps to reproduce
gatk --java-options "-Xmx16G -Xms8G" GenomicsDBImport --genomicsdb-workspace-path /workdir/db_{uid} --reference {genome} --intervals {subint} --sample-name-map sample_map --max-num-intervals-to-import-in-parallel 4 --merge-input-intervals --tmp-dir /tmpdir
Expected behavior
Chr 7 ahead of Chr10. (7 should be before 10)
Actual behavior
Chr 10 ahead of Chr 7. (1 is before 7)
@EsperanzaDai How are the intervals you're passing to GenomicsDBImport sorted? What tool are you using to read from the GenomicsDB and produce a VCF?
The intervals we are passing to GenomicsDBImport are sorted by chromosomal numbers for a subset of chromosomal regions: chr 7, 8, 9, 10.
The next step is CreateSomaticPanelOfNormals or GenotypeGVCFs depending on the pipelines. But the output for both steps all shows the problem that chr 10 is ahead of chr 7 (chr 10, 7, 8, and 9).
Thanks!
Revision: Affected version(s) 4.3.0.0
Re-tested with the latest version, the bug remained: Affected version(s) 4.4.0.0
@EsperanzaDai Can you try running SelectVariants on your GenomicsDB, and check whether you still get the contigs out-of-order.
@nalinigans / @mlathara , have either of you seen reports of GenomicsDB returning records in lexicographical order instead of sequence dictionary order?
GenomicsDBImport
itself doesn't do anything with order of the intervals -- each interval is stored in a separate array under the workspace. Can you share what the contents of your workspace folder look like? Also, what is the error you're seeing from the downstream tool?
Hi, thank you very much for the reply!
The content in the interval with the problem looks like this: chr9$number (a folder) chr10$number (a folder) chr8$number (a folder) chr7$number (a folder) callset.json vidmap.json vcfheader.vcf _tiledb_workspace.tdb
If within each chr, the order is right, I wonder if the problem comes later: When CreateSomaticPanelOfNormals or GenotypeGVCFs reads the output from the GenomicsDBImport, the reading order is wrong? Although, I could not manually manage the order.
The error I got is that the output vcf file from CreateSomaticPanelOfNormals or GenotypeGVCF has chr 10 ahead of chr 7. (CreateSomaticPanelOfNormals or GenotypeGVCF reads the GenomicsDBImport outputs)
Can you look at the order of chromosomes/contigs in the vcfheader.vcf file? Are those in the correct order?
Thank you for replying!
The contigs in vcfheader.vcf were not subset for only chr 7-10. It contains the information for all chromosomes and is not always in the right order. (ID=chr1, ID=chr2 ... in the right order; ID=chr1_KI270706v1_random et al. are not in the right order.) The vidmap.json file in that folder has the right chr1-22 chromosome order.
But the thing is, for other sub-intervals in the runs. For example, chr 3-5 was in one sub-interval run, and chr 12-15 was in another run but their output is in the right order. Their vcfheader.vcf files also contain the information for all chromosomes and are not all in the right order.
So, as you can see, the problem lies when the single-digit chromosomes and double-digit chromosomes are mixed in one sub-interval. Because, numerically, single-digit chromosomes should be before 10, but characteristically, 10 should be ahead of 7, 8, 9, et al. Either the order is messed up in GenomicsDBImport or is mixed in the following CreateSomaticPanelOfNormals or GenotypeGVCFs steps.
It contains the information for all chromosomes and is not always in the right order. (ID=chr1, ID=chr2 ... in the right order; ID=chr1_KI270706v1_random et al. are not in the right order.)
Just to confirm that chr7-10 show up in the right order in vcfheader.vcf, then? GenomicsDBImport
doesn't store anything else that could impact chromosome/contig order AFAIK (order in the vidmap.json
doesn't matter). Can you share what your GenotypeGVCF
/CreateSomaticPanelOfNormals
commandline is and specifically the stacktrace/error?
The command lines:
gatk --java-options "-Xmx16G" GenotypeGVCFs --reference {genome} --variant gendb://workdir/db_{idx} --output output.vcf.gz
gatk --java-options "-Xmx16G" CreateSomaticPanelOfNormals --reference {genome} --germline-resource {afsource} --variant gendb://workdir/pon_db_{idx} --output pon.vcf.gz
The output vcf files from the above two steps have chromosome 10 before 7. GenomicsDBImport, GenotypeGVCF, or CreateSomaticPanelOfNormals did not give any error messages.