Bismark More alignments to sequence when included in whole genome than to the sequence alone

Hi Felix,

We have recently started analysing publicly-available single-end RRBS data from an organism we had never considered in the past.

We are mostly interested in the ribosomal DNA (rDNA), which due to its repetitive nature tends to not be (explicitly) included in published genome assemblies, so we always tweak those assemblies by explicitly including the rDNA consensus sequence as an additional contig. Unfortunately, although not explicitly included, full or partial rDNA unit sequences might appear unannotated in the assemblies nonetheless, particularly in unplaced contigs, so we try to fish for those sequences and mask them out.

For this organism, there appeared to be many more partial sequence collisions in the genome assembly than we are used to (according to Blastn). Nevertheless, we decided to mask out solely full unit clashes to start with and check whether the loss of coverage that those partial collisions could cause was bearable before attempting to mask every single of them all throughout the genome. To check this, we aligned the RRBS reads with bismark first to the rDNA unit alone (which could cause some spurious alignments from similar regions elsewhere in the genome) and then compare these alignments to those that ended being assigned to the rDNA unit contig on a WG+rDNA (partially-)masked assembly. We thus expected to observe a reduced number of rDNA-mapped reads in the latter approach, which is what happens in almost all cases, except two very clear outliers:

Those two don't make any sense to us whatsoever. With a more comprehensive assembly, the opportunities for multi-mapping should increase, and thus the number of reads reported as aligning to a particular contig should never increase compare to when there is no possibility of multi-mapping, right?

We have checked all reads that are reported as aligning to the rDNA on the WG+rDNA approach but are missing on the alignments to the rDNA alone, just in case there was any difference in their distribution, but the two cases appear to match quite well:

Cigar strings in both cases also seem similar.

Can you think of any possible explanation for how this could be happening, and if so, is there any way of mitigating it?

Cheers, Fran

Feb 28 '22 10:02 franrodalg

Hi Fran,

Admittedly, his got my thoughts in a bit of a twist! Just from a technical point, I assume you included the rDNA as an extra sequence/contig/chromosome and didn't stitch it right to the end of an existing chromosome? And secondly, with masking do you mean hard-masking (with NNNs), or soft-masking (with lower case letters)? I believe only hard-masking should have the desired effect.

Overall, I would agree that the whole scenario is somewhat counter-intuitive, and you would not expect more alignments when you add a contig that potentially results in multi-mapping. Nothing springs to my mind that would immediately explain this, but I would probably try the following next:

when you add the rDNA consensus sequence(s?), I would add NN on either side of the sequence. This is only because Bismark requires 2 additonal bp upstream and downstream (depending on the read orientation) to determine the cytosine context. In some extreme corner cases one could imagine that there are some sequences that would align well to the consensus sequence, but gets discarded because of this issue.
another option might be sequences that align to only partially to the consensus sequence, and read a bit further out into a a genomic stretch. These sequences would then align to a genomic place, but not to the consensus sequence itself. You should be able to test this hypothesis by using --local for the alignments - this should then effectively align to both sequences, and eliminate the alignments from the results.

I would probably also suggest to align all sequences to the the rDNA consensus on its own, while specifying the option --unmapped. Only unmapped reads could then be aligned to the genomic sequence (with or w/o masking), which should then results in holes in the coverage wherever you had rDNA reads. Such an rDNA first approach would then allow you to asses rDNA methylation levels on its own (if you are interested in that), and also allow you to easily tinker with the stringency for the rDNA mapping (e.g. --score_min L,0,-0.6, or even --local) before moving on to the genomic alignments.

It is very well possible that I got myself confused here, too, let me know if there is anything I can do to help.

Feb 28 '22 13:02 FelixKrueger

Thanks so much, Felix.

Indeed, we have appended the rDNA sequence as a separate contig, and hard masked the potentially conflicting regions of the genome with Ns.

I will try adding Ns at both sides of the rDNA sequence and run again.

This makes me wonder, though... if bismark finds a read that would partially map to two regions (e.g., first half to the genome region neighbouring the rDNA unit and the second half to the rDNA unit sequence itself), it would never report two alignments, right? I assume it would discard them both or keep only the "best" alignment. If this is the case, I don't really see how partial alignments could be causing an increase in the reads reported to belong to the rDNA. Or am I misunderstanding your point?

Feb 28 '22 14:02 franrodalg

Yes it is true, Bismark woudn't report alignments if you get different portions of the read aligning to different positions in the genome (such alignments would be not be considered concordant). So yes, I would agree that partial alignments should not per se lead to an increase in rDNA.

Maybe working with a subtractive method (using --unmapped) in combination with the unmasked and masked genome would help shed some on it (is there maybe a faint possibility that something went wrong during the masking process?).

Feb 28 '22 19:02 FelixKrueger

I used the same exact masking method we've used in the past for other organisms, and out of 134 samples only two appear to show this behaviour, so I don't have any reason to suspect the masked genome has any particular issue. But even if it did (such as having masked beyond what it was intended) I still don't see how reads that weren't aligning to the rDNA now would. It's really puzzling!

Anyway, many thanks for your suggestions, Felix. I will keep digging and will report back soon :)

On Mon, 28 Feb 2022, 19:23 Felix Krueger, @.***> wrote:

Yes it is true, Bismark woudn't report alignments if you get different portions of the read aligning to different positions in the genome (such alignments would be not be considered concordant). So yes, I would agree that partial alignments should not per se lead to an increase in rDNA.

Maybe working with a subtractive method (using --unmapped) in combination with the unmasked and masked genome would help shed some on it (is there maybe a faint possibility that something went wrong during the masking process?).

— Reply to this email directly, view it on GitHub https://github.com/FelixKrueger/Bismark/issues/474#issuecomment-1054585992, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZD5EWAADJBJ6RDQHBBEC3U5PDT5ANCNFSM5PQVSRDQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you authored the thread.Message ID: @.***>

Feb 28 '22 19:02 franrodalg

Thanks, this is quite intriguing. Hope you'll manage to spot what is is going!

Feb 28 '22 19:02 FelixKrueger

Bismark Bismark copied to clipboard

More alignments to sequence when included in whole genome than to the sequence alone

Bismark
Bismark copied to clipboard