noodles icon indicating copy to clipboard operation
noodles copied to clipboard

[feature] Pileup Engine

Open sstadick opened this issue 3 years ago • 3 comments

This is a first draft of a pileup engine using noodles for the underlying IO.

Is this something you would be interested in having in the noodles family? If you want to keep noodles more on the IO side I totally understand and I'll move this into it's own library. I just wanted to get your thoughts before I start adding examples / docs / tests. (This is not the final PR, just a draft, more work would be done to get this code on par with the rest of noodles.)

The impetus for me is basically to skip the middleman in htslib and have the pileup engine supply more information up front and not have to re-analyze the stack of reads for this tool: https://github.com/sstadick/perbase. The implementation below is very much based on sambamba's impl, which does provide more info up front.

Any and all feedback is welcome :+1:

BTW, noodles is a fantastic set of libraries, thank you for doing this in the open!

sstadick avatar Apr 18 '21 20:04 sstadick

Nice, this is a great initiative!

Let's include something like this. Even though, as you mentioned, it's more algorithmic than I/O, pileup is a fairly common operation and is likely to be expected from an alignments reader.

zaeleus avatar Apr 21 '21 00:04 zaeleus

Awesome! I'll get it into PR-worthy shape then, and smooth out the rough edges and finish off all the TODO's at the top of pileup.rs.

Once that's in place I'll convert from a draft and we can iterate from there :+1:

sstadick avatar Apr 21 '21 02:04 sstadick

I'm still going to come back to this and have not forgotten about it, life has just not conspired to make time for piling up reads lately 👍

sstadick avatar Jun 15 '21 16:06 sstadick

I'd be interested in using something like this if it were in noodles.

brentp avatar May 23 '23 17:05 brentp

I added a simple pileup iterator in acd49bb625bcc23194590006f28a76064923001f that currently just calculates column depths. It piles records over an adaptive window on the reference sequence and is optimized for low latency, i.e., it emits columns immediately after they are guaranteed to no longer be affected by future records. This implies that it only works with coordinate-sorted data. It doesn't include all the counts as in this patch, but it can be iterated and built upon.

Thanks to @sstadick for the initial implementation and inspiration.

zaeleus avatar Jun 14 '23 19:06 zaeleus

Thanks @zaeleus . A really clean implementation. I'll look into expanding this in the future.

brentp avatar Jun 19 '23 23:06 brentp