htslib icon indicating copy to clipboard operation
htslib copied to clipboard

Wish list - automatic merging and/or concatenation.

Open jkbonfield opened this issue 3 years ago • 0 comments

We have the BCF synched reader code which can on-the-fly merge (and dedup / intersect) multiple BCF files.

I'm thinking there is room for a similar action in the SAM (et al) world. Specifically:

  • Open a file referencing other files; manifest, fofn, whatever we wish to call it.
  • Specify files to be either concatenated to previous in the list or merged with previous in the list.
  • Specifying data filtering rules

Example concatenation use case.

  1. Run some data-processing pipeline (eg local realignment) on a per-chromosome fashion.
  2. Then read the data without having to do a "samtools cat" to physically merge back into a single file.
  3. Run some analysis on the reconstituted file.

Example merge use case.

  1. Produce a bunch of BAMs per sample.
  2. Merge BAMs into a single file
  3. Run a joint caller on the multi-sample BAM.

In both examples step 2 and 3 could be done together. They are solved right now via use of pipes, so it's not critical, but this may make life simpler.

Note I'm not advocating replacing samtools cat and samtools merge. They have extra specialisms and I think inside htslib this should only be the most mundane and simplest of functions - ones where the SAM headers must match and there's no "magic" to do. If you like making lots of files each with different headers, then you're not looking for an easy solution ;-)

What could the input file look like? Many possibilities. Some examples:

chr1_samp1.bam
chr1_samp2.bam merge
chr1_samp3.bam merge
chr2_samp1.bam
chr2_samp2.bam merge
chr2_samp3.bam merge

Fofn, defaults to file concatenation, but "merge" command indicates it's merged into the previous file (like "squash" vs "pick" on git commits during rebase). So this concatenates chr1 with chr2 while merging the samp[123] files together.

[ chr1_samp1.bam chr1_samp2.bam chr1_samp3.bam ] [ chr2_samp1.bam chr2_samp2.bam chr2_samp3 ]

If we wish to merge we include them in square brackets, otherwise each file is just whitespace separated and assumed to be concatenation. This I think is my favoured one. It still means the naive traditonal "fofn" just works as concatenation.

Or json style:

{
  "merge": [
    "chr1_samp1.bam",
    "chr1_samp2.bam",
    "chr1_samp3.bam"
  ],
  "merge": [
    "chr2_samp1.bam",
    "chr2_samp2.bam",
    "chr2_samp3 "
  ]
}

Yet more syntaxes could be invented, eg #includes, etc. Ideas welcomed, but I envisage supporting one only. It's the concept right now that I think is important to discuss.

The expectation is we'd just do a samtools view bams.fofn (or bams.json?) and it'd simply Do The Right Thing (TM).

It could be further extended to support on-the-fly filtering. Eg say we want to merge samp1.bam and samp2.bam with secondaries removed, maybe something like:

[
    samp1.bam {flag != secondary}
    samp2.bam {flag != secondary}
]

Here clearly an example use is avoiding having foo.bam and foo.rmdup.bam on disk at the same time, but wanting the instructions written down with the data rather than internal knowledge of having to do filtering. Note filtering via this way is an easy win over piping too as we can index and do random access on our file while still filtering.

Thoughts?

jkbonfield avatar Nov 03 '20 12:11 jkbonfield