gatk icon indicating copy to clipboard operation
gatk copied to clipboard

GenotypeGVCFs report java.lang.OutOfMemoryError: Java heap space while call incremental imported GenomicsDB

Open LYOKOIIIYYR opened this issue 1 year ago • 3 comments

Bug Report

Affected tool(s) or class(es)

gatk GenomicsDBImport GenotypeGVCFs

Affected version(s)

The Genome Analysis Toolkit (GATK) v4.5.0.0

Description

Hi, Here is my situation, I'm testing the feasibility of incremental GenomicsDB,I have total 400 samples to joint calling, I have no problem directly using GenomicsDBImport and GenotypeGVCFs for joint calling of all 400 samples. The configuration used is 4c32g for GenomicsDBImport and 2c16g for GenotypeGVCFs. But when I first built a GenomicsDB of 200 samples using GenomicsDBImport successfully, and then use GenomicsDB --genomicsdb-update-workspace-path increment 200 samples into the GenomicsDB , use this incremental imported GenomicsDB to GenotypeGVCFs. The error happend and report GENOMICSDB_TIMER,Exception in thread "main" java.lang.OutOfMemoryError: Java heap space Here are my code

gatk --java-options "-Xms8000m -Xmx~{max_mem}m" \
            GenomicsDBImport \
            --tmp-dir $PWD \
            --genomicsdb-workspace-path ~{workspace_dir_name}~{prefix}.~{index} \
            --batch-size 50 \
            -L ~{intervals} \
            --reader-threads 5 \
            --merge-input-intervals \
            --consolidate \
            -V ~{sep = " -V " single_sample_gvcfs}

gatk --java-options "-Xms8000m -Xmx~{max_mem}m" \
            GenomicsDBImport \
            --tmp-dir $PWD \
            --genomicsdb-update-workspace-path ~{workspace_dir_name} \
            --batch-size 50 \
            --reader-threads 5 \
            --merge-input-intervals \
            --consolidate \
            -V ~{sep = " -V " single_sample_gvcfs}

gatk --java-options "-Xms8000m -Xmx~{max_mem}m" \
            GenotypeGVCFs \
            --tmp-dir $PWD \
            -R ~{ref} \
            -O ~{workspace_dir_name}.vcf.gz \
            -G StandardAnnotation \
            --only-output-calls-starting-in-intervals \
            -V gendb://~{workspace_dir_name} \
            -L ~{intervals} \
            --merge-input-intervals \
           -all-sites

And I found that before report error the number of threads used by GATK increased, but the memory usage did not exceed the maximum limit of the server. I also cheched --max-alternate-alleles and --genomicsdb-max-alternate-alleles to a smaller size but still the same error

I would appreciate some insights in why that is.

Thanks, Yang

LYOKOIIIYYR avatar Apr 16 '24 08:04 LYOKOIIIYYR

Hi @LYOKOIIIYYR You seem to set your heapsize to the maximum memory size that you have which we do not recommend. GenotypeGVCFs does not need that much memory if I can recall. Can you set the heapsize to a more moderate number such as 8gb or 12 gb and try that way?

gokalpcelik avatar Apr 16 '24 15:04 gokalpcelik

Yes, it's important to realize that GenomicsDB is implemented in C (not Java), and so the memory allocated for GenomicsDB is whatever is NOT allocated to Java (ie., whatever is left over after -Xmx). So -Xmx should never claim all of the memory on the machine, and should leave enough free memory for GenomicsDB to use.

droazen avatar Apr 16 '24 17:04 droazen

There is no problem on runing GenomicsDBImport , and @gokalpcelik I have already tried Xmx10G to Xmx 14G and get the same error. I'm most curious about why running GenomicsDB GenotypeGVCFs directly with 400 samples on the same computational resources can succeed, while running incremental GenomicsDB GenotypeGVCFs with 200 + 200 samples fails.

LYOKOIIIYYR avatar Apr 17 '24 03:04 LYOKOIIIYYR