allow any variant caller to be run on a series of downsampled reads
For some analyses we're doing, I'd like to be able to run variant callers on randomly downsampled data.
An example analyses would be: run MuTect on subsamples of my BAM with {0.1, 0.2, ..., 1.0} fraction of the reads present, for a total of 10 runs. It be useful if I could specify any arbitrary schedule of downsampling rates, potentially with duplicates, e.g.: {0.1, 0.1, 0.1, 0.5, 0.5, 0.5}, so that we can get replicates with the same fraction of reads sampled (but a different random sample).
The downsampling can be performed on a BAM with "samtools view". This tool is actually fairly clever in that it keeps mate pairs together (either both the reads in the pair are sampled or none). Here's the bash loop I've been using so far for generating down samples manually:
for i in $(seq 1 9) ; do
samtools view -s 0.$i -b original.clean.dedup.recal.bam > subsampled-0.$i.bam
done
According to samtools help, the integer part of the -s option can be used to specify the random seed, which we'd have to do in a careful way to support downsampling schedules with duplicates.