avocado
avocado copied to clipboard
Per position allele count reporting utility ( like bam-readcount)
Looking for general comments about utility and approach - not ready for final review yet.
This PR adds functionality to produce a per-position report of counts of reads reporting an A/C/G/T/N or ins/del at each genomic position covered by at least one read in an input alignmentRecordRDD
The functionality is intended to mirror that of the bam-readcount program https://github.com/genome/bam-readcount which is useful to get such per-position stats for use in downstream applications such as testing new variant calling models or evaluating background sequencing noise at a given position.
The approach of this implementation is to perform a alignmentRecordRDD.mapPartitions, in which each partition builds in memory a hash of type mutable.Map[ReferencePosition, AlleleCounts] which is then returned as RDD[ReferencePosition, AlleleCounts], then combined across partitions by reduceByKey adding counts at positions, to resolve partition overlapping reads. This per-partition approach seemed efficient given the size of the hash is limited to the fraction of the genome in a partition ( which can be small if genomic pos sorted), and avoids a flatmap that would produce a large RDD with every base of every read as elements.
The CIGAR/MD reading algorithm was adapted from existing DiscoverVariants, but the context of reporting every covered position and use of the mapPartitions is sufficiently different that I think a new function/object makes sense rather than adding options to DiscoverVariants
- Before finishing up this PR I wanted to get feedback on:
- Just checking that this user facing per-position summary function doesn't already exist in Avocado or elsewhere in ADAM?
- Is a command line utility in Avocado a reasonable place for this functionality to live?
Todo:
- further validate, compare against Bam-readcount output, add tests.
Can one of the admins verify this patch?
+1! This looks like a great start. I'll make more of a pass over this later this week. Thanks for the first cut @jpdna!
Jenkins, add to whitelist.
Coverage decreased (-3.2%) to 76.705% when pulling d6137af9ec741a2671f78f0e269ad5cb315ddf87 on jpdna:posAlleleStats into e0979dd11a2fd0fd8714cd062b9e7ee835a265e5 on bigdatagenomics:master.
Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/avocado-prb/232/