fgbio icon indicating copy to clipboard operation
fgbio copied to clipboard

Calculate the duplication rate/Metrics for non-duplex bam files

Open angadps opened this issue 4 years ago • 2 comments

Dear authors,

This is a very basic operation and apologies if the functionality is already available, but I couldn't find it in either the tool documentation or this github page.

I'm looking for sequencing metrics such as duplication rate, total read count etc. much like Picard throws in non-UMI bams. The only metrics tool I find looks for paired UMIs using duplex reads (I'm not running duplex).

I would run Picard itself, but the fgbio tools retain only unique reads in the consensus bam. There is no intermediate bam file with duplicate read tags to read from.

My workaround for now is to get family size counts when running GroupReadsByUMI, and estimate the duplication rate from there. Ideally I would be looking for a metrics tool that would generate various metrics by default (and perhaps on the grouped.bam ?).

Please comment on the same and apologies if the solution is already available but I couldn't find it.

Thanks, --Angad.

angadps avatar Mar 16 '20 15:03 angadps

@angadps this seems like a completely reasonable request. You can also take a look at the auxiliary tags in the consensus BAM, that give you the number of raw reads used to build a consensus, but that's not always accurate, as some reads are filtered before hand. I am going to label this issue as help wanted unless you want to help sponsor this project.

nh13 avatar Apr 13 '20 17:04 nh13

@nh13 Thanks for the suggestion. Let me review the numbers (cD Tag?) for my datasets first to better understand. Alternately, the "-f" switch in GroupReadsUMI supposedly throws summary metrics for something like that? However, the numbers don't seem to add up and I haven't been able to find clear documentation regarding what exactly those numbers capture. Are they not the number of read families/molecules for each cD value in the bam file? Thank you.

angadps avatar Apr 22 '20 12:04 angadps