gatk
gatk copied to clipboard
GenomicsDB malloc unaligned tcache chunk error
Bug Report
Affected tool(s) or class(es)
- Tool/class name(s), special parameters: GenomicsDBImport
Affected version(s)
- Version: gatk4-4.4.0.0-0
Description
Hello,
I have been having an issue come up when utilizing GenomicsDBImport
. This issue has happened when using a range of samples and shard counts (8 - 1000 samples, shard count of up to 2000). My current example is an attempt to joint call 1000 samples together. I will submit the jobs and 1-2 of the shards (of the ~100 concurrently running) will throw a malloc(): unaligned tcache chunk detected
. When I resubmit that shard, it will usually rerun without a problem. Or if I kill all jobs and resubmit, a different shard will throw the malloc error.
I have run approximately 20 tests and I seem to get this failure 2/3 times. However, it only arises on the initial submission and not when additional jobs are submitted as previous shards complete. Please note that the 1000 samples have successfully been imported into the GenomicsDB but this error seems to persist somewhat randomly across multiple machines.
Thank you for your assistance!
Steps to reproduce
- Command used (omitting paths to 1000 samples for brevity) for one of the failed shards.
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx8g -jar /gpfs/gpfs_de6000/home/dalegre/miniconda3/envs/GOASTv4.0/share/gatk4-4.4.0.0-0/gatk-package-4.4.0.0-local.jar GenomicsDBImport -V [samples 1-1002] --genomicsdb-workspace-path results/jointcalling/genomicsDB/temp_0882_of_2000_DB --merge-input-intervals false --bypass-feature-reader --tmp-dir temp --max-num-intervals-to-import-in-parallel 10 --batch-size 50 --intervals results/germline/interval/temp_0882_of_2000/scattered.interval_list --genomicsdb-shared-posixfs-optimizations true
Expected behavior
All shards are imported into the GenomicsDB successfully.
Actual behavior
Tell us what happens instead
job dies with this error:
malloc(): unaligned tcache chunk detected
23:45:26.793 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gpfs/gpfs_de6000/home/dalegre/miniconda3/e
nvs/GOASTv4.0/share/gatk4-4.4.0.0-0/gatk-package-4.4.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
23:45:26.822 INFO GenomicsDBImport - ------------------------------------------------------------
23:45:26.824 INFO GenomicsDBImport - The Genome Analysis Toolkit (GATK) v4.4.0.0
23:45:26.824 INFO GenomicsDBImport - For support and documentation go to https://software.broadinstitute.org/gatk/
23:45:26.824 INFO GenomicsDBImport - Executing as [email protected] on Linux v5.14.0-284.11.1.el9_2.x86_64 amd6
4
23:45:26.824 INFO GenomicsDBImport - Java runtime: OpenJDK 64-Bit Server VM v17.0.3-internal+0-adhoc..src
23:45:26.824 INFO GenomicsDBImport - Start Date/Time: February 6, 2024 at 11:45:26 PM CET
23:45:26.824 INFO GenomicsDBImport - ------------------------------------------------------------
23:45:26.824 INFO GenomicsDBImport - ------------------------------------------------------------
23:45:26.825 INFO GenomicsDBImport - HTSJDK Version: 3.0.5
23:45:26.825 INFO GenomicsDBImport - Picard Version: 3.0.0
23:45:26.825 INFO GenomicsDBImport - Built for Spark Version: 3.3.1
23:45:26.826 INFO GenomicsDBImport - HTSJDK Defaults.COMPRESSION_LEVEL : 2
23:45:26.826 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
23:45:26.826 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
23:45:26.826 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
23:45:26.826 INFO GenomicsDBImport - Deflater: IntelDeflater
23:45:26.827 INFO GenomicsDBImport - Inflater: IntelInflater
23:45:26.827 INFO GenomicsDBImport - GCS max retries/reopens: 20
23:45:26.827 INFO GenomicsDBImport - Requester pays: disabled
23:45:26.827 INFO GenomicsDBImport - Initializing engine
23:45:46.550 INFO FeatureManager - Using codec IntervalListCodec to read file file:///gpfs/gpfs_de6000/home/dalegre/projects/1000-Genomes/jointcalling-test/goast_workflows/JointCalling/test_samples-1000.1.4/results/germline/interval/temp_0882_of_2000/scattered.interval_list
23:45:46.584 INFO IntervalArgumentCollection - Processing 1086188 bp from intervals
23:45:46.586 INFO GenomicsDBImport - Done initializing engine
23:45:47.489 INFO GenomicsDBLibLoader - GenomicsDB native library version : 1.4.4-ce4e1b9
23:45:47.491 INFO GenomicsDBImport - Vid Map JSON file will be written to /gpfs/gpfs_de6000/home/dalegre/projects/1000-Genomes/jointcalling-test/goast_workflows/JointCalling/test_samples-1000.1.4/results/jointcalling/genomicsDB/temp_0882_of_2000_DB/vidmap.json
23:45:47.491 INFO GenomicsDBImport - Callset Map JSON file will be written to /gpfs/gpfs_de6000/home/dalegre/projects/1000-Genomes/jointcalling-test/goast_workflows/JointCalling/test_samples-1000.1.4/results/jointcalling/genomicsDB/temp_0882_of_2000_DB/callset.json
23:45:47.491 INFO GenomicsDBImport - Complete VCF Header will be written to /gpfs/gpfs_de6000/home/dalegre/projects/1000-Genomes/jointcalling-test/goast_workflows/JointCalling/test_samples-1000.1.4/results/jointcalling/genomicsDB/temp_0882_of_2000_DB/vcfheader.vcf
23:45:47.491 INFO GenomicsDBImport - Importing to workspace - /gpfs/gpfs_de6000/home/dalegre/projects/1000-Genomes/jointcalling-test/goast_workflows/JointCalling/test_samples-1000.1.4/results/jointcalling/genomicsDB/temp_0882_of_2000_DB
malloc(): unaligned tcache chunk detected
@nalinigans Any thoughts on this?
Almost looks like there is a buffer overrun somewhere. Most of our testing has been on nfs
and have not encountered a tcache(thread local cache) issue. Is gpfs
available as open source?
If it helps, I have seen this error when using local drives exclusively (not attached to a shared file system).
Twice it has manifested as a core dump that points to C [libc.so.6+0xaf4f9] malloc+0x169
:
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x000014cfb1d504f9, pid=1182729, tid=1195264
#
# JRE version: OpenJDK Runtime Environment (17.0.3) (build 17.0.3-internal+0-adhoc..src)
# Java VM: OpenJDK 64-Bit Server VM (17.0.3-internal+0-adhoc..src, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# C [libc.so.6+0xaf4f9] malloc+0x169
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h" (or dumping to /gpfs/gpfs_de6000/home/dalegre/projects/1000-Genomes/jointcalling-test/goast_workflows/JointCalling/test_samples-1000.1.3/core.1182729)
#
# If you would like to submit a bug report, please visit:
# https://bugreport.java.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
[dalegre@login4601 fdone]$ head -n 20 hs_err_pid1182729.log
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x000014cfb1d504f9, pid=1182729, tid=1195264
#
# JRE version: OpenJDK Runtime Environment (17.0.3) (build 17.0.3-internal+0-adhoc..src)
# Java VM: OpenJDK 64-Bit Server VM (17.0.3-internal+0-adhoc..src, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# C [libc.so.6+0xaf4f9] malloc+0x169
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h" (or dumping to /gpfs/gpfs_de6000/home/dalegre/projects/1000-Genomes/jointcalling-test/goast_workflows/JointCalling/test_samples-1000.1.3/core.1182729)
#
# If you would like to submit a bug report, please visit:
# https://bugreport.java.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#
@danagibbon thanks for this pointer. What versions of gatk have you seen this error on?
@nalinigans thank you for the prompt replies! I'm using gatk4-4.4.0.0-0
I will try the latest version next week when our cluster is back online (currently undergoing scheduled maintenance).
Thanks @danagibbon, I may know what the issue is. hdfs
support in GenomicsDB still relies on JVM/Java 11 and we had some workarounds with thread local caches from a while ago. I will create a branch sometime next week without hdfs
which will hopefully get us past this issue.
Thank you, much appreciated!!! Have a nice weekend.
@danagibbon, here is the branch - https://github.com/broadinstitute/gatk/tree/ng_remove_hdfs_support. Can you build gatk from this branch and try it out please? If the problem still exists, can you attach the core dump file too. Thanks.