adam Spare not needed groupBy when calling toFragments() on AlignmentDataset

Hi!

I'm running a process that is pre-processing a bunch of reads before aligning them using Bowtie. Most of them are unpaired, so when I run toFragments(), I need to groupBy() them for no actual reason. Is there a way to spare this groupBy?

Looking at the code, I think we can add a variable to signify when we know for sure when we have unpaired files. When we are unsure, we'll do the groupBy anyway (maybe let the user tell us by adding a parameter to loadAlignments).

I'd love to implement it.

WDYT? Ben

Nov 07 '20 06:11 benraha

@heuermh Would love your thoughts on that before I implement it.

Nov 10 '20 17:11 benraha

If all you want to do is a straight conversion 1:1 of Alignment to Fragment, there are the transmute/transmuteDataFrame/transmuteDataset APIs, e.g.

https://javadoc.io/static/org.bdgenomics.adam/adam-core-spark3_2.12/0.32.0/org/bdgenomics/adam/rdd/read/AlignmentDataset.html#transmuteX,Y%3C:Product,Z%3C:org.bdgenomics.adam.rdd.GenomicDataset[X,Y,Z]:Z

An example of this can be found in the unit tests https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/test/scala/org/bdgenomics/adam/rdd/read/AlignmentDatasetSuite.scala#L126 https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/test/scala/org/bdgenomics/adam/rdd/read/AlignmentDatasetSuite.scala#L1543

I think a new method toUnpairedFragments() that leaves out the groupBy might be ok.

Then for calling bowtie, in Cannoli we have bowtie2, a function FragmentDataset → AlignmentDataset, and singleEndBowtie2, a function AlignmentDataset → AlignmentDataset. If starting from mixed set of reads, you could filter out unpaired reads and run them separately through singleEndBowtie2 as to not incur the cost of toFragments and then union the results together.

There isn't currently a singleEndBowtie in Cannoli but I doubt it would be difficult to add one.

Nov 10 '20 18:11 heuermh

These are good, but I want to use the knowledge ADAM already has on the data instead of relying on the user to know it, or maybe there's some problem regarding this that I don't know of?

Something like that (taken from loadAlignments):

BAM -> unpaired InterleavedFastQ -> paired FASTQ -> paired / unpaired like ADAM works today FASTA -> unpaired? PARQUET -> can be paired

Nov 10 '20 18:11 benraha

Those assumptions can fall apart though, from experience BAM/CRAM/SAM files can contain paired reads, unpaired reads, aligned reads, and unaligned reads. It is common to use unaligned BAM (uBAM) in workflows instead of FASTQ because it compresses better.

We would of course encourage the use of Parquet because it compresses better, doesn't have problems with split guessing, can take advantage of push down predicates and column projection, and can be read/write concurrently in distributed fashion across a cluster. 😉

That said, please feel free to suggest changes!

Nov 10 '20 21:11 heuermh

adam adam copied to clipboard

Spare not needed groupBy when calling toFragments() on AlignmentDataset

adam
adam copied to clipboard