[Clarification] Why are alignments fractional?
Hello - thank you for engineering such a helpful tool.
I'm running through the tutorial and examining the file "gene_wise_quantifications_combined.tsv", which I understand contains genewise/transcript counts along with raw counts.
I notice that the count data is between 0 and 76.83, with plenty of fractional counts. In this pre-normalized data, why do fractional counts occur? What do they represent? I was under the impression that the gene/transcript-wise quantification at this step was simply counting (i.e. integer quantification) of segemehl alignments to a region.
Thank you for your help!
Per default each alignment of a read is normalized by the total number of alignments the given read has. Additionally each alignment is normalized by the number of features it overlaps with. Both normalization methods are turned on per default an will result in fractions. You can turn off these options via: --no_count_split_by_alignment_no --no_count_splitting_by_gene_no Further information can be found in the gene quantification documentation
We'll change the documentation to make this point more clear!
Ah, thank you so much for explaining!
My ultimate goal is to use the output of the gene quantification step (ideally) and feed it into a DESeq2-based downstream pipeline - we have an unusual experimental design and I'd like to try altering the design matrix beyond what we'd be able to do in READemption.
Is the split-multimapped-read count table "gene_wise_quantifications_combined.tsv" also the count matrix that is fed into DESeq2? (The source of my confusion was that I thought DESeq2 only accepted matrix integer inputs.)
And am I correct in understanding that --unique_only would count only the reads that are not multimapping?
Said another way, setting no_count_split_by_alignment_no (and no_count_splitting_by_gene_no) to true would mean that reads are double-counted (e.g. a read mapping to multiple genes is +1 for the count of each gene, or +0?), right? Not that they are thrown out?
Thank you very much!
"And am I correct in understanding that --unique_only would count only the reads that are not multimapping?" Yes! "Said another way, setting no_count_split_by_alignment_no (and no_count_splitting_by_gene_no) to true would mean that reads are double-counted (e.g. a read mapping to multiple genes is +1 for the count of each gene, or +0?), right? Not that they are thrown out?" Yes, you are right! "The source of my confusion was that I thought DESeq2 only accepted matrix integer inputs" I believe it also accepts floats as inputs, but maybe they are rounded by DESeq2
Hello! I had a follow-up to this -- when I ran this command: reademption gene_quanti --unique_only -p 4 --features gene,cds,region,exon --project_path READemption_analysis , I still produced an output with fractional reads (e.g., 547.33, 258.5). My understanding is that there should not be any fractional reads because the --unique_only option would eliminate multimapping reads, and therefore there are no reads to be split to produce the multimapping fractional counts. Could you help me understand what is happening here? (I am looking at the file "gene_wise_quantifications_combined.csv")
The fractions likely result from normalizing alignments by the number of features they overlap with. So if a read aligns to a single position, but the position is part of two features (e.g. gene and exon) each feature gets a count of 0.5. You can turn this option off by using the parameter --no_count_splitting_by_gene_no.