gatk icon indicating copy to clipboard operation
gatk copied to clipboard

Unusual CPU Load Spikes for SplitNCigar

Open von1laughing opened this issue 3 years ago • 1 comments

Bug Report

Affected tool(s) or class(es)

gatk SplitNCigarReads

Affected version(s)

  • gatk 4.2.6.1

Description

I produced the bam files using STAR, and adjusted the MQ value to 60. I then used sambamba markdup to mark duplicate, then I proceeded to use SplitNCigarReads.

The CPU load for SplitNCigarReads was very high and at certain times can spike up to 2400%. I tried limiting the cpu usage with commands like -XX:ParallelGCThreads=1 and -XX:ConcGCThreads=1, but it doesn't seem to have an effect. (The cpu usage sometimes do stay at 100%) I also adjusted the MQ value in STAR to lessen the load in SplitNCigarReads. I also tried to increase the read size to reduce I/O time. image

Steps to reproduce

STAR

STAR \
--genomeDir ${star_reference_path} \
--runThreadN 16 \
--readFilesIn ${file_1} ${file_2} \
--readFilesCommand "gunzip -c" \
--sjdbOverhang 149 \
--outSAMtype BAM SortedByCoordinate \
--outBAMsortingThreadN 16 \
--outSAMmultNmax 1 \
--outSAMmapqUnique 60 \
--outSAMattrRGline ID:${id} LB:RNASEQ SM:${sample_name} PL:ILLUMINA PU:${platform_unit} PM:${instrument_id} \
--limitBAMsortRAM 50000000000 \
--twopassMode Basic \
--outFileNamePrefix /rawdata/rnaseq/clean/bam/1.

Mark Duplicate

sambamba markdup \
-t 4 \
--tmpdir=/tmp \
--hash-table-size=262144 \
--overflow-list-size=67108864 \
 /rawdata/rnaseq/clean/bam/1.Aligned.sortedByCoord.out.bam \
 /rawdata/rnaseq/clean/bam/1.aligned.duplicate_marked.sorted.bam \

SplitNCigarReads

gatk --java-options "-Djava.io.tmpdir=/tmp -Xmx20G -XX:ParallelGCThreads=1 -XX:ConcGCThreads=1" SplitNCigarReads \
-R ${reference_path} \
--tmp-dir /tmp \
-I /rawdata/rnaseq/clean/bam/1.aligned.duplicate_marked.sorted.bam \
-O /rawdata/rnaseq/clean/bam_gatk/1.aligned.duplicate_marked.sorted.bam \
--create-output-bam-md5 TRUE \
--max-reads-in-memory 1000000 \
--skip-mapping-quality-transform TRUE \

von1laughing avatar Jun 23 '22 04:06 von1laughing

@von1laughing Can you try running jstack on the running GATK process when the CPU usage is ~2400%, and paste the output here? This will produce a dump of the Java threads. You'll need to provide jstack with the process ID (pid) of the running Java process.

droazen avatar Jul 05 '22 19:07 droazen