fgbio
fgbio copied to clipboard
CallMolecularConsensusReads - collapsed UMI sequence contains 'N'
Hi, It seems that CallMolecularConsensusReads collapses the UMI sequence and assigns 'N' to UMI bases that are ambiguous based on the constituent reads.
I couldn't find where this was documented, but I'm curious if this behavior can be suppressed? Because if one wants to do GroupReadsByUmi again later, it doesn't work when there are Ns in the UMI sequence.
Thanks.
I think you want to run GroupReadsByUmi
first, then CallMolecularConsensusReads
. What am I missing? Feel free to re-open if this is still relevant.
I have a bit of a different situation than the usual one. My library structure provides sequence on both top and bottom strand in each individual flow cell cluster. And there are PCR duplicates. So I first use GroupReadsByUmi to group together all the PCR duplicates of the top strand sequence -> CallMolecularConsensusReads, and the same thing for the bottom strand. Then I want to call a duplex consensus, so I do GroupReadsByUmi again -> CallMolecularConsensusReads again for the duplex consensus.
So the second time I run GroupReadsByUmi, it is called after CallMolecularConsensusReads, which leads to the above problem where I lose groups due to 'N' bases in the UMI. The fix would be to give ambiguous UMI bases in GruopReadsByUmi the majority vote or based on base quality instead of 'N'.
Or as in my other post, would need some way to get the list of read names for the consensus of each individual strand, because the forward strand consensus and reverse strand consensus are coming from the same clusters with the same read names (except some might drop out due to quality, so it's not an exact match of read name sets). Not clear how to use MI tag to help this, as it would require collating all read names for each MI tag, and then comparing forward strand consensus read names to reverse strand consensus read names.
@gevro Ah, I think you want GroupReadsByUmi --strategy paired
, followed by CallDuplexConsensusReads
(see our wiki). I think this fits your use case.
Thanks, but that approach won't work unfortunately. GroupReadsByUmi --strategy --> CallDuplexConsensusReads generates two duplex consensus sequences from A1+B2 and A2+B1. However, my library structure has inserts smaller than the read length, so actually read 1 and read 2 fully overlap in each flow cell cluster. This is in addition to top and bottom strand data from each flow cell cluster.
What I would need is:
- Call consensus from all top strand reads: This will include R1 top strand and R2 top strand reads from all the clusters that belong to the UMI family.
- Call consensus from all bottom strand reads: This will include R1 bottom strand and R2 bottom strand reads from all the clusters that belong to the UMI family.
- Call duplex consensus from (1) and (2)
Is there any way to do that?
Unrelated, note there is a bit of an inconsistency in documentation (Does GroupReadsByUmi make /AB|BA or /A|B format?):
- GroupReadsByUmi: "The molecular IDs produced have more structure than for single UMI strategies, and are of the form {base}/{AB|BA}."
- CallDuplexConsensusReads: "Prior to running this tool, read must have been grouped with GroupReadsByUmi using the paired strategy. Doing so will apply (by default) MI tags to all reads of the form */A and */B "