sarek icon indicating copy to clipboard operation
sarek copied to clipboard

CalculateContamination table needs to be sorted

Open wt12318 opened this issue 1 year ago • 15 comments

Hi,

When I run following pipeline:

nextflow run /public/slst/home/wutao2/nf-core-sarek-3.1.1/workflow/ --input /public/slst/home/wutao2/sun_data/nf_pipeline/KAPA/Non_PDX/non_pdx_kapa.csv --outdir /public/slst/home/wutao2/sun_data/nf_pipeline/KAPA/Non_PDX/output --genome GATK.GRCh38 -profile singularity --wes --tools mutect2,merge --only_paired_variant_calling --max_cpus 36 --max_memory '128.GB' --save_mapped --save_output_as_bam --igenomes_base /public/slst/home/wutao2/aws-igenomes/igenomes --intervals /public/slst/home/wutao2/sun_data/nf_pipeline/KAPA/KAPA_target.bed 

It shows CalculateContamination Error :

-[nf-core/sarek] Pipeline completed with errors-
Error executing process > 'NFCORE_SAREK:SAREK:BAM_VARIANT_CALLING_SOMATIC_ALL:BAM_VARIANT_CALLING_SOMATIC_MUTECT2:CALCULATECONTAMINATION (H208323_vs_H208344)'

Caused by:
  Process `NFCORE_SAREK:SAREK:BAM_VARIANT_CALLING_SOMATIC_ALL:BAM_VARIANT_CALLING_SOMATIC_MUTECT2:CALCULATECONTAMINATION (H208323_vs_H208344)` terminated with an error exit status (3)

Command executed:

  gatk --java-options "-Xmx12g" CalculateContamination \
      --input H208323.mutect2.pileupsummaries.table \
      --output H208323_vs_H208344.mutect2.contamination.table \
      --matched-normal H208344.mutect2.pileupsummaries.table \
      --tmp-dir . \
      -tumor-segmentation H208323_vs_H208344.mutect2.segmentation.table
  
  cat <<-END_VERSIONS > versions.yml
  "NFCORE_SAREK:SAREK:BAM_VARIANT_CALLING_SOMATIC_ALL:BAM_VARIANT_CALLING_SOMATIC_MUTECT2:CALCULATECONTAMINATION":
      gatk4: $(echo $(gatk --version 2>&1) | sed 's/^.*(GATK) v//; s/ .*$//')
  END_VERSIONS

Command exit status:
  3

Command output:
  (empty)

Command error:
  14:45:39.575 INFO  CalculateContamination - Shutting down engine
  [December 19, 2022 at 2:45:39 PM GMT] org.broadinstitute.hellbender.tools.walkers.cr of points needed to calculate local changepoint costs (2 * window size = 100) exceeds number of data points (11).  Local changepoint costs will not be calculated for this window size.
  14:45:38.169 WARN  KernelSegmenter - No changepoint candidates were found.  The specified window sizes may be inappropriate, or there may be insufficient data points.
  14:45:38.170 INFO  KernelSegmenter - Found 0 changepoints after applying the changepoint penalty.
  14:45:38.533 INFO  KernelSegmenter - Found 0 changepoints after applying the changepoint penalty.
  14:45:38.659 INFO  KernelSegmenter - Found 0 changepoints after applying the changepoint penalty.
  14:45:38.811 INFO  KernelSegmenter - Found 2 changepoints after applying the changepoint penalty.
  14:45:38.840 WARN  KernelSegmenter - Specified dimension of the kernel approximation (100) exceeds the number of data points (91) to segment; using all data points to calculate kernel matrix.
  14:45:39.008 WARN  KernelSegmenter - Number of points needed to calculate local changepoint costs (2 * window size = 100) exceeds number of data points (91).  Local changepoint costs will not be calculated for this window size.
  14:45:39.009 WARN  KernelSegmenter - No changepoint candidates were found.  The specified window sizes may be inappropriate, or there may be insufficient data points.
  14:45:39.010 INFO  KernelSegmenter - Found 0 changepoints after applying the changepoint penalty.
  14:45:39.117 INFO  KernelSegmenter - Found 2 changepoints after applying the changepoint penalty.
  14:45:39.239 INFO  KernelSegmenter - Found 0 changepoints after applying the changepoint penalty.
  14:45:39.367 INFO  KernelSegmenter - Found 0 changepoints after applying the changepoint penalty.
  14:45:39.479 INFO  KernelSegmenter - Found 0 changepoints after applying the changepoint penalty.
  14:45:39.574 INFO  KernelSegmenter - Found 0 changepoints after applying the changepoint penalty.
  14:45:39.575 INFO  CalculateContamination - Shutting down engine
  [December 19, 2022 at 2:45:39 PM GMT] org.broadinstitute.hellbender.tools.walkers.contamination.CalculateContamination done. Elapsed time: 0.07 minutes.
  Runtime.totalMemory()=2147483648
  java.lang.IllegalArgumentException: Invalid interval. Contig:chr22 start:41093206 end:40819179
  	at org.broadinstitute.hellbender.utils.Utils.validateArg(Utils.java:804)
  	at org.broadinstitute.hellbender.utils.SimpleInterval.validatePositions(SimpleInterval.java:59)
  	at org.broadinstitute.hellbender.utils.SimpleInterval.<init>(SimpleInterval.java:35)
  	at org.broadinstitute.hellbender.tools.walkers.contamination.ContaminationSegmenter.lambda$findContigSegments$4(ContaminationSegmenter.java:71)
  	at java.base/java.util.stream.IntPipeline$1$1.accept(IntPipeline.java:180)
  	at java.base/java.util.stream.Streams$RangeIntSpliterator.forEachRemaining(Streams.java:104)
  	at java.base/java.util.Spliterator$OfInt.forEachRemaining(Spliterator.java:699)
  	at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
  	at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
  	at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913)
  	at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
  	at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578)
  	at org.broadinstitute.hellbender.tools.walkers.contamination.ContaminationSegmenter.findContigSegments(ContaminationSegmenter.java:72)
  	at org.broadinstitute.hellbender.tools.walkers.contamination.ContaminationSegmenter.lambda$findSegments$1(ContaminationSegmenter.java:43)
  	at java.base/java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:271)
  	at java.base/java.util.HashMap$ValueSpliterator.forEachRemaining(HashMap.java:1693)
  	at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
  	at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
  	at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913)
  	at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
  	at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578)
  	at org.broadinstitute.hellbender.tools.walkers.contamination.ContaminationSegmenter.findSegments(ContaminationSegmenter.java:45)
  	at org.broadinstitute.hellbender.tools.walkers.contamination.ContaminationModel.<init>(ContaminationModel.java:61)
  	at org.broadinstitute.hellbender.tools.walkers.contamination.CalculateContamination.doWork(CalculateContamination.java:124)
  	at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:140)
  	at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:192)
  	at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211)
  	at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
  	at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
  	at org.broadinstitute.hellbender.Main.main(Main.java:289)

Work dir:
  /slst/home/wutao2/work/e8/82be9a8b520189393b3406eb9dc6f3

Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`

The whole output log can be found in here : run_netflow_log.zip.

But when I checked the dir /slst/home/wutao2/work/e8/82be9a8b520189393b3406eb9dc6f3:

(base) -bash-4.2$ ls -a
.   .command.begin  .command.log  .command.run  .command.trace  H208323.mutect2.pileupsummaries.table
..  .command.err    .command.out  .command.sh   .exitcode       H208344.mutect2.pileupsummaries.table

The .command.err file shows some warnings but no errors, the files in this dir can be found in here : command.zip.

How could I fix this problem ? Thank you.

Tao Wu

wt12318 avatar Dec 19 '22 23:12 wt12318

The error message is java.lang.IllegalArgumentException: Invalid interval. Contig:chr22 start:41093206 end:40819179, but I have no idea how to fix it. There are two pileupsummaries files : mutect2.pileupsummaries.zip

wt12318 avatar Dec 21 '22 01:12 wt12318

Hi Tao, I'm not part of the Sarek team but I encountered the same error as you and I think I found the issue. It seems like the pileup summary table inputted is not sorted, so when CalculateContamination traverses the file it tries to make invalid intervals:

$ awk '{print $1}' A5.mutect2.pileupsummaries.table | uniq
#<METADATA>SAMPLE=A5_A5
contig
chr1
chr10
chr2
chr19
chr2
chr20
chr3
chr22
chr3
chr4
chr5
chr6
chr7
chr8
chr9
chrX
chr10
chr11
chr12
chr13
chr14
chr15
chr16
chr17
chr18
chr19
chr20
chr21
chr22
chrX

the structure is the same as the order of files in the directory for the GatherPileupSummaries step, so it seems like the code just cats the pileups and doesn't automatically sort by variants. I was able to get the code to work by sorting the table manually:

$ head -n 2 A5.mutect2.pileupsummaries.table > sorted.table 
tail -n +3 A5.mutect2.pileupsummaries.table | sort -V -k1 -k2 >> sorted.table

I then replaced the original pileup table in .command.sh with sorted.table, this time the code succeeded:

$ bash .command.run
[January 20, 2023 at 9:35:28 PM UTC] org.broadinstitute.hellbender.tools.walkers.contamination.CalculateContamination done. Elapsed time: 0.20 minutes.
Runtime.totalMemory()=1140850688
Tool returned:
SUCCESS
$ cat .exitcode
0

As a temporary solution, I'm just changing the source code of Sarek to sort the table before running CalculateContamination so that I don't have to manually do it for every new sample. Hopefully there will be a fix in either Sarek or GATK coming up.

Best, Evan Wu

evanwu1119 avatar Jan 20 '23 21:01 evanwu1119

@evanwu1119, if you like you could also submit this fix via a PR :) Otherwise we try to get around to it as soon as possible

FriederikeHanssen avatar Jan 23 '23 14:01 FriederikeHanssen

A quick temporary fix would be to do the sorting (as suggested by @evanwu1119) in the process defined by the module (gatk4/calculatecontamination). So change the first part of the script in modules/nf-core/gatk4/calculatecontamination/main.nf to:

    head -n 2 $pileup > pileup.sorted.table 
    tail -n +3 $pileup | sort -V -k1 -k2 >> pileup.sorted.table 

    head -n 2 $matched > matched.sorted.table 
    tail -n +3 $matched | sort -V -k1 -k2 >> matched.sorted.table

    gatk --java-options "-Xmx${avail_mem}g" CalculateContamination \\
        --input pileup.sorted.table \\
        --output ${prefix}.contamination.table \\
        --matched-normal matched.sorted.table \\
        --tmp-dir . \\
        $args

But this assumes the --matched-normal option is always used.

Would love to submit a fix in a PR @FriederikeHanssen, but I don't think I'm experienced enough to make the right choices here.

GeertvanGeest avatar Jan 23 '23 16:01 GeertvanGeest

My data is for tumor only so I didn't deal with the matched normals table. I think it's probably best if you guys solve it more systematically, sorry! Probably it would be best to sort the tables in the GatherPileupSummaries module so that you don't have to deal with the conditional of whether there is a matched table or not in CalculateContamination. It's also the step that introduces the error anyways, but I'm not sure if there are any differences between the normal/tumor module or if there needs to be workflow changes.

evanwu1119 avatar Jan 24 '23 18:01 evanwu1119

Adding this to the milestone for 3.2

maxulysse avatar Jan 25 '23 09:01 maxulysse

I used the most up-to-date version of GATK (https://hub.docker.com/r/broadinstitute/gatk-nightly/) to run the following workflow (input is the recalibration bam file saved by sarek), and the pileup summary table was sorted correct.

wt12318 avatar Jan 27 '23 08:01 wt12318

That's good to know, hopefully that'll be fix at the next GATK release then

maxulysse avatar Jan 27 '23 08:01 maxulysse

I ran into the same issue when running a few tumor-only samples, using --tools mutect2,strelka. However, I realized that my interval BED file used was not sorted (i.e. chromosomal order), and when I fixed this part, the workflow succeeded without any error. Not completely sure whether this is the reason or underlying nature of the issue, but thought it might be worth mentioning.

sigven avatar Mar 15 '23 09:03 sigven

Hi @sigven ! Thanks for sharing. I guess, there could be multiple issues that result in the same error? We are also looking into "validating" the bed file beforehand. I just need to find a good tool. so hopefully in the future some of this can be caught early on

FriederikeHanssen avatar Mar 15 '23 09:03 FriederikeHanssen

I ran into the same issue when running a few tumor-only samples, using --tools mutect2,strelka. However, I realized that my interval BED file used was not sorted (i.e. chromosomal order), and when I fixed this part, the workflow succeeded without any error. Not completely sure whether this is the reason or underlying nature of the issue, but thought it might be worth mentioning.

This fixed it for me also.

cat intervals.bed | sort -k1,1V -k2,2n > intervals_sorted.bed for 'chr'-prefixed chromosome names

peterkilfeather avatar Mar 16 '23 14:03 peterkilfeather

@wt12318 c , Could you please explain how did you use and how did you enter back sarek pipeline? Thanks

paolo-kunderfranco avatar Mar 31 '23 06:03 paolo-kunderfranco

I have multiple of the same errors. Is there a way to solve it other than manually sorting the bed files?

pdeepak87 avatar Aug 25 '23 02:08 pdeepak87

I have multiple of the same errors. Is there a way to solve it other than manually sorting the bed files?

unfortunately not at the moment, but we're thinking of a solution

maxulysse avatar Aug 25 '23 08:08 maxulysse

I failed at reproducing the error. I ran

gatk --java-options "-Xmx5324M" CalculateContamination \
    --input sample2.mutect2.pileups.table \
    --output sample2.mutect2.contamination.table \
    --tmp-dir . \
    -tumor-segmentation sample2.mutect2.segmentation.table

with sample2.mutect2.pileups.table unsorted like this:

#<METADATA>SAMPLE=tumour
contig  position        ref_count       alt_count       other_alt_count allele_frequency
chr21   10569534        39      0       1       0.0777421
chr1    10559797        22      0       0       0.0836978
chr21   10605716        32      0       1       0.180505
...    

No problem running the GATK-CalculateContamination cmd using GATK v4.2.6.1, v4.3 or v4.4.

asp8200 avatar Aug 25 '23 09:08 asp8200