RNA-Bloom icon indicating copy to clipboard operation
RNA-Bloom copied to clipboard

Coverage does not add up

Open schorlton opened this issue 2 years ago • 4 comments

Thanks again for your help!

When using RNA-Bloom, I expected (perhaps naively) that the sum of coverage of each contig X length of same contig would not exceed the number of bases in my input dataset. Or at least would be close, accounting for some margin of error. (Ie reads would only contribute to coverage of a single transcript, as they only originated from one transcript).

However, this does not seem to be the case.

Here is my calculation of input bases from the transcripts FASTA: assembly_info.xls Sum of coverage of each contig X length of same contig = 354,613,891bp

assembly_size# seqkit stats input.fastq 
file            format  type  num_seqs      sum_len  min_len  avg_len  max_len
input.fastq  FASTQ   DNA    415,469  200,885,818      100    483.5    4,601

So: is a single read having its bases assigned to multiple transcripts, and therefore increasing the coverage of multiple transcripts?

Please report

Command and version is the same as in this issue: https://github.com/bcgsc/RNA-Bloom/issues/18

schorlton avatar Oct 26 '21 14:10 schorlton

The sum isn't the same because each base of a given read can be assigned to more than one contig, especially when it aligns to the overlap between two contigs.

For example, the +++++ region of the following read would be double-counted:

           read
          -------+++++--------
======================
                 ================
   contig 1             contig 2

This double-counted region could be fractionally assigned to the two contigs so that the c value can be more reflective of the "true" read count. It is definitely something for me to work on in the future, but it is not a priority for me at the moment.

kmnip avatar Oct 26 '21 17:10 kmnip

Thanks. Would this situation only arise if we fail to assemble a full transcript? With the above diagram, I'd expect it to optimally merge these contigs. I can see how if there was only a single read bridging these transcripts, that may be weak evidence to merge them, but I'm wondering when else you would encounter this situation. The double counting of bases seems to be quite significant for the number of transcript overlaps I expected.

schorlton avatar Oct 26 '21 17:10 schorlton

In the above diagram, contigs 1 and 2 can still be join together into a transcript if there is sufficient overlap at the contig edges. Double-counting can also happen when reads map to more than one target sequence. For example, shorter reads representing shared exons in isoforms would likely multi-map.

kmnip avatar Oct 26 '21 17:10 kmnip

Oh yeah, that makes sense. Then you get into the purpose of RNA-seq quantification tools. I suppose you're right, fractional coverage would be one solution to properly calculate coverage. Thanks for your insight and I hope you can bump this up your priority list just a bit!

schorlton avatar Oct 26 '21 18:10 schorlton