gatk
gatk copied to clipboard
PreprocessIntervals `--padding` has no effect
To reproduce:
Conda env with latest GATK:
# gatk.yaml
channels:
- conda-forge
- bioconda
dependencies:
- gatk4 ==4.6.2.0
Input file (ref.fa) with 64 positions:
>1
ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT
Create basic files:
micromamba create -f gatk.yaml -n gatk-v4.6.2.0
micromamba activate gatk-v4.6.2.0
samtools faidx ref.fa
awk 'BEGIN{FS="\t"; OFS="\t"} {print $1, 0, $2}' ref.fa.fai > ref.bed
gatk CreateSequenceDictionary \
-R ref.fa \
-O ref.dict
gatk BedToIntervalList \
-I ref.bed \
-SD ref.dict \
-O ref.interval_list
Problematic call:
BIN_LENGTH=10
PADDING=5
gatk PreprocessIntervals \
-R ref.fa \
-L ref.interval_list \
--bin-length $BIN_LENGTH \
--padding $PADDING \
--interval-merging-rule OVERLAPPING_ONLY \
-O ref.shard_list
Output:
@HD VN:1.6
@SQ SN:1 LN:64 M5:ff68171411c912714ddcaee815380130 UR:file:///.../ref.fa
1 1 10 + .
1 11 20 + .
1 21 30 + .
1 31 40 + .
1 41 50 + .
1 51 60 + .
1 61 64 + .
This has the correct bin length, but does not use any padding, which seems like an important bug, as this might create wrong variant calls downstream when using these intervals as input to the genotyping steps.
Furthermore, the above command needs to contain the option --interval-merging-rule OVERLAPPING_ONLY, otherwise I am getting:
java.lang.IllegalArgumentException: Interval merging rule must be set to OVERLAPPING_ONLY.
which is weird. What is the point of having an option like that, which needs to be specified in exactly one way? That seems to be not an intuitive user interface, and could just be left out instead.