gatk icon indicating copy to clipboard operation
gatk copied to clipboard

GenomicsDBImport sorting issue

Open EsperanzaDai opened this issue 1 year ago • 11 comments

Bug Report

Affected tool(s) or class(es)

GenomicsDBImport

Affected version(s)

Latest

Description

Output from GenomicsDBImport is sorting based on alphabetic order: chr 10 ahead of chr 7

Steps to reproduce

gatk --java-options "-Xmx16G -Xms8G" GenomicsDBImport --genomicsdb-workspace-path /workdir/db_{uid} --reference {genome} --intervals {subint} --sample-name-map sample_map --max-num-intervals-to-import-in-parallel 4 --merge-input-intervals --tmp-dir /tmpdir

Expected behavior

Chr 7 ahead of Chr10. (7 should be before 10)

Actual behavior

Chr 10 ahead of Chr 7. (1 is before 7)

EsperanzaDai avatar Jul 13 '23 16:07 EsperanzaDai

@EsperanzaDai How are the intervals you're passing to GenomicsDBImport sorted? What tool are you using to read from the GenomicsDB and produce a VCF?

droazen avatar Jul 13 '23 19:07 droazen

The intervals we are passing to GenomicsDBImport are sorted by chromosomal numbers for a subset of chromosomal regions: chr 7, 8, 9, 10.
The next step is CreateSomaticPanelOfNormals or GenotypeGVCFs depending on the pipelines. But the output for both steps all shows the problem that chr 10 is ahead of chr 7 (chr 10, 7, 8, and 9). Thanks!

EsperanzaDai avatar Jul 14 '23 14:07 EsperanzaDai

Revision: Affected version(s) 4.3.0.0

EsperanzaDai avatar Jul 19 '23 15:07 EsperanzaDai

Re-tested with the latest version, the bug remained: Affected version(s) 4.4.0.0

EsperanzaDai avatar Jul 19 '23 15:07 EsperanzaDai

@EsperanzaDai Can you try running SelectVariants on your GenomicsDB, and check whether you still get the contigs out-of-order.

@nalinigans / @mlathara , have either of you seen reports of GenomicsDB returning records in lexicographical order instead of sequence dictionary order?

droazen avatar Jul 24 '23 19:07 droazen

GenomicsDBImport itself doesn't do anything with order of the intervals -- each interval is stored in a separate array under the workspace. Can you share what the contents of your workspace folder look like? Also, what is the error you're seeing from the downstream tool?

mlathara avatar Jul 24 '23 22:07 mlathara

Hi, thank you very much for the reply!

The content in the interval with the problem looks like this: chr9$number (a folder) chr10$number (a folder) chr8$number (a folder) chr7$number (a folder) callset.json vidmap.json vcfheader.vcf _tiledb_workspace.tdb

If within each chr, the order is right, I wonder if the problem comes later: When CreateSomaticPanelOfNormals or GenotypeGVCFs reads the output from the GenomicsDBImport, the reading order is wrong? Although, I could not manually manage the order.

The error I got is that the output vcf file from CreateSomaticPanelOfNormals or GenotypeGVCF has chr 10 ahead of chr 7. (CreateSomaticPanelOfNormals or GenotypeGVCF reads the GenomicsDBImport outputs)

EsperanzaDai avatar Jul 26 '23 16:07 EsperanzaDai

Can you look at the order of chromosomes/contigs in the vcfheader.vcf file? Are those in the correct order?

mlathara avatar Jul 27 '23 16:07 mlathara

Thank you for replying!

The contigs in vcfheader.vcf were not subset for only chr 7-10. It contains the information for all chromosomes and is not always in the right order. (ID=chr1, ID=chr2 ... in the right order; ID=chr1_KI270706v1_random et al. are not in the right order.) The vidmap.json file in that folder has the right chr1-22 chromosome order.

But the thing is, for other sub-intervals in the runs. For example, chr 3-5 was in one sub-interval run, and chr 12-15 was in another run but their output is in the right order. Their vcfheader.vcf files also contain the information for all chromosomes and are not all in the right order.

So, as you can see, the problem lies when the single-digit chromosomes and double-digit chromosomes are mixed in one sub-interval. Because, numerically, single-digit chromosomes should be before 10, but characteristically, 10 should be ahead of 7, 8, 9, et al. Either the order is messed up in GenomicsDBImport or is mixed in the following CreateSomaticPanelOfNormals or GenotypeGVCFs steps.

EsperanzaDai avatar Jul 31 '23 21:07 EsperanzaDai

It contains the information for all chromosomes and is not always in the right order. (ID=chr1, ID=chr2 ... in the right order; ID=chr1_KI270706v1_random et al. are not in the right order.)

Just to confirm that chr7-10 show up in the right order in vcfheader.vcf, then? GenomicsDBImport doesn't store anything else that could impact chromosome/contig order AFAIK (order in the vidmap.json doesn't matter). Can you share what your GenotypeGVCF/CreateSomaticPanelOfNormals commandline is and specifically the stacktrace/error?

mlathara avatar Jul 31 '23 22:07 mlathara

The command lines:

gatk --java-options "-Xmx16G" GenotypeGVCFs --reference {genome} --variant gendb://workdir/db_{idx} --output output.vcf.gz

gatk --java-options "-Xmx16G" CreateSomaticPanelOfNormals --reference {genome} --germline-resource {afsource} --variant gendb://workdir/pon_db_{idx} --output pon.vcf.gz

The output vcf files from the above two steps have chromosome 10 before 7. GenomicsDBImport, GenotypeGVCF, or CreateSomaticPanelOfNormals did not give any error messages.

EsperanzaDai avatar Aug 01 '23 14:08 EsperanzaDai