gatk icon indicating copy to clipboard operation
gatk copied to clipboard

PreprocessIntervals `--padding` has no effect

Open lczech opened this issue 5 months ago • 0 comments

To reproduce:

Conda env with latest GATK:

# gatk.yaml
channels:
  - conda-forge
  - bioconda
dependencies:
  - gatk4 ==4.6.2.0

Input file (ref.fa) with 64 positions:

>1
ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT

Create basic files:

micromamba create -f gatk.yaml -n gatk-v4.6.2.0
micromamba activate gatk-v4.6.2.0

samtools faidx ref.fa
awk 'BEGIN{FS="\t"; OFS="\t"} {print $1, 0, $2}' ref.fa.fai > ref.bed

gatk CreateSequenceDictionary \
    -R ref.fa \
    -O ref.dict

gatk BedToIntervalList \
    -I ref.bed \
    -SD ref.dict \
    -O ref.interval_list

Problematic call:

BIN_LENGTH=10
PADDING=5

gatk PreprocessIntervals \
    -R ref.fa \
    -L ref.interval_list \
    --bin-length $BIN_LENGTH \
    --padding $PADDING \
    --interval-merging-rule OVERLAPPING_ONLY \
    -O ref.shard_list

Output:

@HD	VN:1.6
@SQ	SN:1	LN:64	M5:ff68171411c912714ddcaee815380130	UR:file:///.../ref.fa
1	1	10	+	.
1	11	20	+	.
1	21	30	+	.
1	31	40	+	.
1	41	50	+	.
1	51	60	+	.
1	61	64	+	.

This has the correct bin length, but does not use any padding, which seems like an important bug, as this might create wrong variant calls downstream when using these intervals as input to the genotyping steps.

Furthermore, the above command needs to contain the option --interval-merging-rule OVERLAPPING_ONLY, otherwise I am getting:

java.lang.IllegalArgumentException: Interval merging rule must be set to OVERLAPPING_ONLY.

which is weird. What is the point of having an option like that, which needs to be specified in exactly one way? That seems to be not an intuitive user interface, and could just be left out instead.

lczech avatar May 05 '25 15:05 lczech