READemption icon indicating copy to clipboard operation
READemption copied to clipboard

[Clarification] Why are alignments fractional?

Open kmuench opened this issue 2 years ago • 5 comments

Hello - thank you for engineering such a helpful tool.

I'm running through the tutorial and examining the file "gene_wise_quantifications_combined.tsv", which I understand contains genewise/transcript counts along with raw counts.

I notice that the count data is between 0 and 76.83, with plenty of fractional counts. In this pre-normalized data, why do fractional counts occur? What do they represent? I was under the impression that the gene/transcript-wise quantification at this step was simply counting (i.e. integer quantification) of segemehl alignments to a region.

Thank you for your help!

kmuench avatar Sep 11 '23 14:09 kmuench

Per default each alignment of a read is normalized by the total number of alignments the given read has. Additionally each alignment is normalized by the number of features it overlaps with. Both normalization methods are turned on per default an will result in fractions. You can turn off these options via: --no_count_split_by_alignment_no --no_count_splitting_by_gene_no Further information can be found in the gene quantification documentation

We'll change the documentation to make this point more clear!

Tillsa avatar Sep 12 '23 07:09 Tillsa

Ah, thank you so much for explaining!

My ultimate goal is to use the output of the gene quantification step (ideally) and feed it into a DESeq2-based downstream pipeline - we have an unusual experimental design and I'd like to try altering the design matrix beyond what we'd be able to do in READemption.

Is the split-multimapped-read count table "gene_wise_quantifications_combined.tsv" also the count matrix that is fed into DESeq2? (The source of my confusion was that I thought DESeq2 only accepted matrix integer inputs.)

And am I correct in understanding that --unique_only would count only the reads that are not multimapping?

Said another way, setting no_count_split_by_alignment_no (and no_count_splitting_by_gene_no) to true would mean that reads are double-counted (e.g. a read mapping to multiple genes is +1 for the count of each gene, or +0?), right? Not that they are thrown out?

Thank you very much!

kmuench avatar Sep 12 '23 12:09 kmuench

"And am I correct in understanding that --unique_only would count only the reads that are not multimapping?" Yes! "Said another way, setting no_count_split_by_alignment_no (and no_count_splitting_by_gene_no) to true would mean that reads are double-counted (e.g. a read mapping to multiple genes is +1 for the count of each gene, or +0?), right? Not that they are thrown out?" Yes, you are right! "The source of my confusion was that I thought DESeq2 only accepted matrix integer inputs" I believe it also accepts floats as inputs, but maybe they are rounded by DESeq2

Tillsa avatar Sep 13 '23 11:09 Tillsa

Hello! I had a follow-up to this -- when I ran this command: reademption gene_quanti --unique_only -p 4 --features gene,cds,region,exon --project_path READemption_analysis , I still produced an output with fractional reads (e.g., 547.33, 258.5). My understanding is that there should not be any fractional reads because the --unique_only option would eliminate multimapping reads, and therefore there are no reads to be split to produce the multimapping fractional counts. Could you help me understand what is happening here? (I am looking at the file "gene_wise_quantifications_combined.csv")

kmuench avatar Sep 28 '23 09:09 kmuench

The fractions likely result from normalizing alignments by the number of features they overlap with. So if a read aligns to a single position, but the position is part of two features (e.g. gene and exon) each feature gets a count of 0.5. You can turn this option off by using the parameter --no_count_splitting_by_gene_no.

Tillsa avatar Oct 10 '23 14:10 Tillsa