sarek
sarek copied to clipboard
CalculateContamination table needs to be sorted
Hi,
When I run following pipeline:
nextflow run /public/slst/home/wutao2/nf-core-sarek-3.1.1/workflow/ --input /public/slst/home/wutao2/sun_data/nf_pipeline/KAPA/Non_PDX/non_pdx_kapa.csv --outdir /public/slst/home/wutao2/sun_data/nf_pipeline/KAPA/Non_PDX/output --genome GATK.GRCh38 -profile singularity --wes --tools mutect2,merge --only_paired_variant_calling --max_cpus 36 --max_memory '128.GB' --save_mapped --save_output_as_bam --igenomes_base /public/slst/home/wutao2/aws-igenomes/igenomes --intervals /public/slst/home/wutao2/sun_data/nf_pipeline/KAPA/KAPA_target.bed
It shows CalculateContamination Error
:
-[nf-core/sarek] Pipeline completed with errors-
Error executing process > 'NFCORE_SAREK:SAREK:BAM_VARIANT_CALLING_SOMATIC_ALL:BAM_VARIANT_CALLING_SOMATIC_MUTECT2:CALCULATECONTAMINATION (H208323_vs_H208344)'
Caused by:
Process `NFCORE_SAREK:SAREK:BAM_VARIANT_CALLING_SOMATIC_ALL:BAM_VARIANT_CALLING_SOMATIC_MUTECT2:CALCULATECONTAMINATION (H208323_vs_H208344)` terminated with an error exit status (3)
Command executed:
gatk --java-options "-Xmx12g" CalculateContamination \
--input H208323.mutect2.pileupsummaries.table \
--output H208323_vs_H208344.mutect2.contamination.table \
--matched-normal H208344.mutect2.pileupsummaries.table \
--tmp-dir . \
-tumor-segmentation H208323_vs_H208344.mutect2.segmentation.table
cat <<-END_VERSIONS > versions.yml
"NFCORE_SAREK:SAREK:BAM_VARIANT_CALLING_SOMATIC_ALL:BAM_VARIANT_CALLING_SOMATIC_MUTECT2:CALCULATECONTAMINATION":
gatk4: $(echo $(gatk --version 2>&1) | sed 's/^.*(GATK) v//; s/ .*$//')
END_VERSIONS
Command exit status:
3
Command output:
(empty)
Command error:
14:45:39.575 INFO CalculateContamination - Shutting down engine
[December 19, 2022 at 2:45:39 PM GMT] org.broadinstitute.hellbender.tools.walkers.cr of points needed to calculate local changepoint costs (2 * window size = 100) exceeds number of data points (11). Local changepoint costs will not be calculated for this window size.
14:45:38.169 WARN KernelSegmenter - No changepoint candidates were found. The specified window sizes may be inappropriate, or there may be insufficient data points.
14:45:38.170 INFO KernelSegmenter - Found 0 changepoints after applying the changepoint penalty.
14:45:38.533 INFO KernelSegmenter - Found 0 changepoints after applying the changepoint penalty.
14:45:38.659 INFO KernelSegmenter - Found 0 changepoints after applying the changepoint penalty.
14:45:38.811 INFO KernelSegmenter - Found 2 changepoints after applying the changepoint penalty.
14:45:38.840 WARN KernelSegmenter - Specified dimension of the kernel approximation (100) exceeds the number of data points (91) to segment; using all data points to calculate kernel matrix.
14:45:39.008 WARN KernelSegmenter - Number of points needed to calculate local changepoint costs (2 * window size = 100) exceeds number of data points (91). Local changepoint costs will not be calculated for this window size.
14:45:39.009 WARN KernelSegmenter - No changepoint candidates were found. The specified window sizes may be inappropriate, or there may be insufficient data points.
14:45:39.010 INFO KernelSegmenter - Found 0 changepoints after applying the changepoint penalty.
14:45:39.117 INFO KernelSegmenter - Found 2 changepoints after applying the changepoint penalty.
14:45:39.239 INFO KernelSegmenter - Found 0 changepoints after applying the changepoint penalty.
14:45:39.367 INFO KernelSegmenter - Found 0 changepoints after applying the changepoint penalty.
14:45:39.479 INFO KernelSegmenter - Found 0 changepoints after applying the changepoint penalty.
14:45:39.574 INFO KernelSegmenter - Found 0 changepoints after applying the changepoint penalty.
14:45:39.575 INFO CalculateContamination - Shutting down engine
[December 19, 2022 at 2:45:39 PM GMT] org.broadinstitute.hellbender.tools.walkers.contamination.CalculateContamination done. Elapsed time: 0.07 minutes.
Runtime.totalMemory()=2147483648
java.lang.IllegalArgumentException: Invalid interval. Contig:chr22 start:41093206 end:40819179
at org.broadinstitute.hellbender.utils.Utils.validateArg(Utils.java:804)
at org.broadinstitute.hellbender.utils.SimpleInterval.validatePositions(SimpleInterval.java:59)
at org.broadinstitute.hellbender.utils.SimpleInterval.<init>(SimpleInterval.java:35)
at org.broadinstitute.hellbender.tools.walkers.contamination.ContaminationSegmenter.lambda$findContigSegments$4(ContaminationSegmenter.java:71)
at java.base/java.util.stream.IntPipeline$1$1.accept(IntPipeline.java:180)
at java.base/java.util.stream.Streams$RangeIntSpliterator.forEachRemaining(Streams.java:104)
at java.base/java.util.Spliterator$OfInt.forEachRemaining(Spliterator.java:699)
at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913)
at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578)
at org.broadinstitute.hellbender.tools.walkers.contamination.ContaminationSegmenter.findContigSegments(ContaminationSegmenter.java:72)
at org.broadinstitute.hellbender.tools.walkers.contamination.ContaminationSegmenter.lambda$findSegments$1(ContaminationSegmenter.java:43)
at java.base/java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:271)
at java.base/java.util.HashMap$ValueSpliterator.forEachRemaining(HashMap.java:1693)
at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913)
at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578)
at org.broadinstitute.hellbender.tools.walkers.contamination.ContaminationSegmenter.findSegments(ContaminationSegmenter.java:45)
at org.broadinstitute.hellbender.tools.walkers.contamination.ContaminationModel.<init>(ContaminationModel.java:61)
at org.broadinstitute.hellbender.tools.walkers.contamination.CalculateContamination.doWork(CalculateContamination.java:124)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:140)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:192)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
at org.broadinstitute.hellbender.Main.main(Main.java:289)
Work dir:
/slst/home/wutao2/work/e8/82be9a8b520189393b3406eb9dc6f3
Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`
The whole output log can be found in here : run_netflow_log.zip.
But when I checked the dir /slst/home/wutao2/work/e8/82be9a8b520189393b3406eb9dc6f3
:
(base) -bash-4.2$ ls -a
. .command.begin .command.log .command.run .command.trace H208323.mutect2.pileupsummaries.table
.. .command.err .command.out .command.sh .exitcode H208344.mutect2.pileupsummaries.table
The .command.err
file shows some warnings but no errors, the files in this dir can be found in here : command.zip.
How could I fix this problem ? Thank you.
Tao Wu
The error message is java.lang.IllegalArgumentException: Invalid interval. Contig:chr22 start:41093206 end:40819179
, but I have no idea how to fix it. There are two pileupsummaries
files : mutect2.pileupsummaries.zip
Hi Tao, I'm not part of the Sarek team but I encountered the same error as you and I think I found the issue. It seems like the pileup summary table inputted is not sorted, so when CalculateContamination traverses the file it tries to make invalid intervals:
$ awk '{print $1}' A5.mutect2.pileupsummaries.table | uniq
#<METADATA>SAMPLE=A5_A5
contig
chr1
chr10
chr2
chr19
chr2
chr20
chr3
chr22
chr3
chr4
chr5
chr6
chr7
chr8
chr9
chrX
chr10
chr11
chr12
chr13
chr14
chr15
chr16
chr17
chr18
chr19
chr20
chr21
chr22
chrX
the structure is the same as the order of files in the directory for the GatherPileupSummaries step, so it seems like the code just cats the pileups and doesn't automatically sort by variants. I was able to get the code to work by sorting the table manually:
$ head -n 2 A5.mutect2.pileupsummaries.table > sorted.table
tail -n +3 A5.mutect2.pileupsummaries.table | sort -V -k1 -k2 >> sorted.table
I then replaced the original pileup table in .command.sh
with sorted.table
, this time the code succeeded:
$ bash .command.run
[January 20, 2023 at 9:35:28 PM UTC] org.broadinstitute.hellbender.tools.walkers.contamination.CalculateContamination done. Elapsed time: 0.20 minutes.
Runtime.totalMemory()=1140850688
Tool returned:
SUCCESS
$ cat .exitcode
0
As a temporary solution, I'm just changing the source code of Sarek to sort the table before running CalculateContamination so that I don't have to manually do it for every new sample. Hopefully there will be a fix in either Sarek or GATK coming up.
Best, Evan Wu
@evanwu1119, if you like you could also submit this fix via a PR :) Otherwise we try to get around to it as soon as possible
A quick temporary fix would be to do the sorting (as suggested by @evanwu1119) in the process defined by the module (gatk4/calculatecontamination
). So change the first part of the script in modules/nf-core/gatk4/calculatecontamination/main.nf
to:
head -n 2 $pileup > pileup.sorted.table
tail -n +3 $pileup | sort -V -k1 -k2 >> pileup.sorted.table
head -n 2 $matched > matched.sorted.table
tail -n +3 $matched | sort -V -k1 -k2 >> matched.sorted.table
gatk --java-options "-Xmx${avail_mem}g" CalculateContamination \\
--input pileup.sorted.table \\
--output ${prefix}.contamination.table \\
--matched-normal matched.sorted.table \\
--tmp-dir . \\
$args
But this assumes the --matched-normal
option is always used.
Would love to submit a fix in a PR @FriederikeHanssen, but I don't think I'm experienced enough to make the right choices here.
My data is for tumor only so I didn't deal with the matched normals table. I think it's probably best if you guys solve it more systematically, sorry! Probably it would be best to sort the tables in the GatherPileupSummaries module so that you don't have to deal with the conditional of whether there is a matched table or not in CalculateContamination. It's also the step that introduces the error anyways, but I'm not sure if there are any differences between the normal/tumor module or if there needs to be workflow changes.
Adding this to the milestone for 3.2
I used the most up-to-date version of GATK (https://hub.docker.com/r/broadinstitute/gatk-nightly/) to run the following workflow (input is the recalibration bam file saved by sarek), and the pileup summary table was sorted correct.
That's good to know, hopefully that'll be fix at the next GATK release then
I ran into the same issue when running a few tumor-only samples, using --tools mutect2,strelka
. However, I realized that my interval BED file used was not sorted (i.e. chromosomal order), and when I fixed this part, the workflow succeeded without any error. Not completely sure whether this is the reason or underlying nature of the issue, but thought it might be worth mentioning.
Hi @sigven ! Thanks for sharing. I guess, there could be multiple issues that result in the same error? We are also looking into "validating" the bed file beforehand. I just need to find a good tool. so hopefully in the future some of this can be caught early on
I ran into the same issue when running a few tumor-only samples, using
--tools mutect2,strelka
. However, I realized that my interval BED file used was not sorted (i.e. chromosomal order), and when I fixed this part, the workflow succeeded without any error. Not completely sure whether this is the reason or underlying nature of the issue, but thought it might be worth mentioning.
This fixed it for me also.
cat intervals.bed | sort -k1,1V -k2,2n > intervals_sorted.bed
for 'chr'-prefixed chromosome names
@wt12318 c , Could you please explain how did you use and how did you enter back sarek pipeline? Thanks
I have multiple of the same errors. Is there a way to solve it other than manually sorting the bed files?
I have multiple of the same errors. Is there a way to solve it other than manually sorting the bed files?
unfortunately not at the moment, but we're thinking of a solution
I failed at reproducing the error. I ran
gatk --java-options "-Xmx5324M" CalculateContamination \
--input sample2.mutect2.pileups.table \
--output sample2.mutect2.contamination.table \
--tmp-dir . \
-tumor-segmentation sample2.mutect2.segmentation.table
with sample2.mutect2.pileups.table
unsorted like this:
#<METADATA>SAMPLE=tumour
contig position ref_count alt_count other_alt_count allele_frequency
chr21 10569534 39 0 1 0.0777421
chr1 10559797 22 0 0 0.0836978
chr21 10605716 32 0 1 0.180505
...
No problem running the GATK-CalculateContamination cmd using GATK v4.2.6.1, v4.3 or v4.4.