adam
adam copied to clipboard
Spare not needed groupBy when calling toFragments() on AlignmentDataset
Hi!
I'm running a process that is pre-processing a bunch of reads before aligning them using Bowtie. Most of them are unpaired, so when I run toFragments(), I need to groupBy() them for no actual reason. Is there a way to spare this groupBy?
Looking at the code, I think we can add a variable to signify when we know for sure when we have unpaired files. When we are unsure, we'll do the groupBy anyway (maybe let the user tell us by adding a parameter to loadAlignments).
I'd love to implement it.
WDYT? Ben
@heuermh Would love your thoughts on that before I implement it.
If all you want to do is a straight conversion 1:1 of Alignment
to Fragment
, there are the transmute
/transmuteDataFrame
/transmuteDataset
APIs, e.g.
https://javadoc.io/static/org.bdgenomics.adam/adam-core-spark3_2.12/0.32.0/org/bdgenomics/adam/rdd/read/AlignmentDataset.html#transmuteX,Y%3C:Product,Z%3C:org.bdgenomics.adam.rdd.GenomicDataset[X,Y,Z]:Z
An example of this can be found in the unit tests https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/test/scala/org/bdgenomics/adam/rdd/read/AlignmentDatasetSuite.scala#L126 https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/test/scala/org/bdgenomics/adam/rdd/read/AlignmentDatasetSuite.scala#L1543
I think a new method toUnpairedFragments()
that leaves out the groupBy might be ok.
Then for calling bowtie, in Cannoli we have bowtie2
, a function FragmentDataset
→ AlignmentDataset
, and singleEndBowtie2
, a function AlignmentDataset
→ AlignmentDataset
. If starting from mixed set of reads, you could filter out unpaired reads and run them separately through singleEndBowtie2
as to not incur the cost of toFragments
and then union the results together.
There isn't currently a singleEndBowtie
in Cannoli but I doubt it would be difficult to add one.
These are good, but I want to use the knowledge ADAM already has on the data instead of relying on the user to know it, or maybe there's some problem regarding this that I don't know of?
Something like that (taken from loadAlignments):
BAM -> unpaired InterleavedFastQ -> paired FASTQ -> paired / unpaired like ADAM works today FASTA -> unpaired? PARQUET -> can be paired
Those assumptions can fall apart though, from experience BAM/CRAM/SAM files can contain paired reads, unpaired reads, aligned reads, and unaligned reads. It is common to use unaligned BAM (uBAM) in workflows instead of FASTQ because it compresses better.
We would of course encourage the use of Parquet because it compresses better, doesn't have problems with split guessing, can take advantage of push down predicates and column projection, and can be read/write concurrently in distributed fashion across a cluster. 😉
That said, please feel free to suggest changes!