Generate MI tags [feature request]
I've aligned reads to a de novo assembly using BWA-MEM rather than Long Ranger align. The BWA alignments have the BX tag but no MI tag. Would you consider implementing a module to group together nearby reads into molecules and adding MI:i tags? No worries at all if you consider this feature outside the scope of bxtools.
lariat (aka longranger align) groups together reads that are within 50 kbp of each other. That's a reasonable default value, but I'd find it very helpful for this to be a configurable parameter. I'd like to set it to something more conservative like perhaps 5 kbp.
Any thoughts on this feature request? My motivation is that longranger wgs takes ~3 days to run (for this particular data set), and bwa mem takes ~1.5 hours to run (on the same data set), so I often run BWA first while longranger wgs is running. I'd like to calculate some stats on molecule sizes using bxtools mol, which needs MI:i tags.
Hi Shaun, apologies a bit slow here. I think this is reasonable, and I started on this a few weeks back but got called away by other things. If I understand correctly, this would basically run along the BAM and label BX tags into an MI (or something else maybe) tag if they are within X bp of each other. I suppose then that right after X bp, any other BX of the same value would then be assigned a new MI tag up until the reader has progressed another X bp.
Is that the design you were thinking of?
No apologies needed. Thanks for your response, Jeremiah. Yep, that's what I'm thinking. It should work on a BAM file sorted by target position. It can keep a priority queue of all the barcodes that are current (seen within the last X bp) and dump them as they expire, and dump them when transitioning to a new reference sequence. The default value for X for Lariat is 50 kbp. I think it's perhaps too large, but it's a reasonable default to match behaviour. The default tag should be MI but configurable by the user. If the MI tag already exists, perhaps it would be useful to move that tag to a different tag (XM perhaps), but that's definitely a third-teer feature request.
Write now I'm converting BAM to TSV and doing this in R. Here's the code: https://github.com/sjackman/picea-glauca-organelles/blob/master/molecules.rmd#group-reads-into-molecules
This feature is now implemented in the utility tigmint-molecule. See https://github.com/bcgsc/tigmint
The input BAM file must be sorted by barcode and position using samtools sort -tBX.