Documentation - more explanation of percent_mapped and metadata information please
Is the bug primarily related to salmon (bulk mode) or alevin (single-cell mode)?
Both, but salmon / bulk mode in this instance
Describe the bug
Salmon appears to have a different idea of mapping percent versus other programs. I'm assuming this based on this answer.
Basically, I want an explanation that I can give to my collaborators that can demonstrate why Salmon is still producing good results at lower mapping rates (assuming that is the case).
Given that alignment fraction is a common QC when doing read mapping, I think that it is important that this is clarified. For example, the run I'm currently looking at has percent_mapped rates of 37-56%:
$ grep percent mapped/salmon_AG*/aux_info/meta*.json
mapped/salmon_AG1232_001_MR1/aux_info/meta_info.json: "percent_mapped": 37.28370866811582,
mapped/salmon_AG1232_002_MR2/aux_info/meta_info.json: "percent_mapped": 53.455500538891637,
mapped/salmon_AG1232_003_MR3/aux_info/meta_info.json: "percent_mapped": 56.060426412206258,
mapped/salmon_AG1232_004_MI1/aux_info/meta_info.json: "percent_mapped": 50.16684584515427,
mapped/salmon_AG1232_005_MI2/aux_info/meta_info.json: "percent_mapped": 47.96459794285113,
mapped/salmon_AG1232_006_MI3/aux_info/meta_info.json: "percent_mapped": 42.43504948857183,
mapped/salmon_AG1232_007_MG1/aux_info/meta_info.json: "percent_mapped": 43.76251592563446,
mapped/salmon_AG1232_008_MG2/aux_info/meta_info.json: "percent_mapped": 46.5666026699582,
mapped/salmon_AG1232_009_MG3/aux_info/meta_info.json: "percent_mapped": 39.4875646010278,
mapped/salmon_AG1232_010_FR1/aux_info/meta_info.json: "percent_mapped": 44.95951527456434,
mapped/salmon_AG1232_011_FR2/aux_info/meta_info.json: "percent_mapped": 44.3245723969125,
mapped/salmon_AG1232_012_FR3/aux_info/meta_info.json: "percent_mapped": 43.92756434947104,
mapped/salmon_AG1232_013_FI1/aux_info/meta_info.json: "percent_mapped": 43.926584646299769,
mapped/salmon_AG1232_014_FI2/aux_info/meta_info.json: "percent_mapped": 47.39591196351944,
mapped/salmon_AG1232_015_FI3/aux_info/meta_info.json: "percent_mapped": 47.30261633277744,
mapped/salmon_AG1232_016_FG1/aux_info/meta_info.json: "percent_mapped": 43.84030781618126,
mapped/salmon_AG1232_017_FG2/aux_info/meta_info.json: "percent_mapped": 47.29698385472076,
mapped/salmon_AG1232_018_FG3/aux_info/meta_info.json: "percent_mapped": 44.262969929840519,
Even when allowing dovetailing, and reducing thresholds, mapping rates don't exceed 65%:
$ grep percent mapped/salmon_MS*/aux_info/meta*.json
mapped/salmon_MS0.33_AG1232_001_MR1/aux_info/meta_info.json: "percent_mapped": 42.6625043194414,
mapped/salmon_MS0.33_AG1232_002_MR2/aux_info/meta_info.json: "percent_mapped": 62.12927768954744,
mapped/salmon_MS0.33_AG1232_003_MR3/aux_info/meta_info.json: "percent_mapped": 63.4821479739517,
mapped/salmon_MS0.33_AG1232_004_MI1/aux_info/meta_info.json: "percent_mapped": 58.058107055773877,
mapped/salmon_MS0.33_AG1232_005_MI2/aux_info/meta_info.json: "percent_mapped": 55.60254094895548,
mapped/salmon_MS0.33_AG1232_006_MI3/aux_info/meta_info.json: "percent_mapped": 48.71359834692358,
mapped/salmon_MS0.33_AG1232_007_MG1/aux_info/meta_info.json: "percent_mapped": 49.90792451225673,
mapped/salmon_MS0.33_AG1232_008_MG2/aux_info/meta_info.json: "percent_mapped": 52.95395875924016,
mapped/salmon_MS0.33_AG1232_009_MG3/aux_info/meta_info.json: "percent_mapped": 44.603004397116929,
mapped/salmon_MS0.33_AG1232_010_FR1/aux_info/meta_info.json: "percent_mapped": 49.8363286696079,
mapped/salmon_MS0.33_AG1232_011_FR2/aux_info/meta_info.json: "percent_mapped": 52.2987149497359,
mapped/salmon_MS0.33_AG1232_012_FR3/aux_info/meta_info.json: "percent_mapped": 50.73178518208944,
mapped/salmon_MS0.33_AG1232_013_FI1/aux_info/meta_info.json: "percent_mapped": 52.302065008945408,
mapped/salmon_MS0.33_AG1232_014_FI2/aux_info/meta_info.json: "percent_mapped": 55.82249021745959,
mapped/salmon_MS0.33_AG1232_015_FI3/aux_info/meta_info.json: "percent_mapped": 55.80153947588767,
mapped/salmon_MS0.33_AG1232_016_FG1/aux_info/meta_info.json: "percent_mapped": 49.49543448190936,
mapped/salmon_MS0.33_AG1232_017_FG2/aux_info/meta_info.json: "percent_mapped": 55.19039678416574,
mapped/salmon_MS0.33_AG1232_018_FG3/aux_info/meta_info.json: "percent_mapped": 50.730150343757518,
The readthedocs link for Salmon suggests, "Most of the information recorded in this file should be self-descriptive", but this is not the case for me.
To Reproduce Steps and data to reproduce the behavior:
- Map reads
- Observe consistent mapping rates below 80%
Specifically, please provide at least the following information:
- Which version of salmon was used? salmon (selective-alignment-based) v1.10.2
- How was salmon installed (compiled, downloaded executable, through bioconda)? package installation via Debian
- Which reference (e.g. transcriptome) was used? transcriptome
- Which read files were used? **NovaSeq X Plus; read length 150bp x 2; untrimmed **
- Which which program options were used?
for sampleName in $(ls -d ag1232/AG* | perl -pe 's/^ag1232.//'); do
salmon quant -p 12 --index reference/salmon_index -l ISR -1 ag1232/${sampleName}/*_1.fq.gz -2 ag1232/${sampleName}/*_2.fq.gz \
--validateMappings --seqBias --gcBias --posBias --numBootstraps 10 --writeUnmappedNames -o mapped/salmon_${sampleName};
done
Making Salmon less stringent:
for sampleName in $(ls -d ag1232/AG* | perl -pe 's/^ag1232.//'); do
salmon quant -p 12 --index reference/salmon_index -l ISR -1 ag1232/${sampleName}/*_1.fq.gz -2 ag1232/${sampleName}/*_2.fq.gz \
--validateMappings --seqBias --gcBias --posBias --softclip --allowDovetail --minScoreFraction 0.33 --recoverOrphans \
--numBootstraps 10 --writeUnmappedNames -o mapped/salmon_MS0.33_${sampleName};
done
Expected behavior
Documentation that has a good explanation of the parameters in the metadata file, sufficient to explain why Salmon mapping rates are different from other programs, and why it's common to see mapping rates below 80% (e.g. here, where a mapping rate of 63% is apparently acceptable).
Consider the following statistics (in meta_info.json):
$ grep -A 7 'num_processed' mapped/salmon_MS*001_*/aux_info/meta*.json
"num_processed": 39191989,
"num_mapped": 16720284,
"num_decoy_fragments": 3376529,
"num_dovetail_fragments": 5188759,
"num_fragments_filtered_vm": 3487789,
"num_alignments_below_threshold_for_mapped_fragments_vm": 3046512,
"percent_mapped": 42.6625043194414,
"call": "quant",
The numbers from unmapped_names.txt are as follows:
$ awk '{print $2}' mapped/salmon_MS0.33_AG1232_001_MR1/aux_info/unmapped_names.txt | sort | uniq -c
3376529 d
495372 m1
469890 m2
18903893 u
I can see that the percent_mapped statistic matches the num_mapped as a proportion of num_processed, but the remaining numbers don't make up the remainder, and (in any case) I had tried to allow dovetail fragments in the command line options. In other words, these numbers don't appear to properly categorise the unmapped reads:
- Unmapped reads = 39191989 - 16720284 = 22471705
- Decoy [+ dovetail] + filtered_vm + below_threshold = 3376529 + 5188759 + 3487789 + 3046512 = 15099589
... leaving 7372116 reads unaccounted. And based on the explanations given in other mapping issue reports, it's possible that there could be multiple fragments in those numbers that contribute to a single read, meaning the unaccounted number is probably higher:
The number you are looking at is the number of discarded mappings, not the number of discarded fragments. The difference is that every fragment can have many potential mappings. The number you are looking at is the total number of attempted alignments that failed to achieve the threshold score. Luckily, salmon reports both numbers. The number of fragments for which all alignments failed to reach the score threshold is 4,196,417; given in aux_info.json by "num_fragments_filtered_vm": 4196417. One point to note is that these are all fragments for which mapping is attempted (they had at least one k-mer match the reference), but no alignment was valid up to the threshold. You could try running the quantification again with --softclip to allow softclipping of the reads and see if any considerable fraction of these 4196417 failed to align because they overhang the annotated transcripts or contain adapters etc. Nonetheless, even if all of these mapped, the rate would still be ~72%. The remainder of the reads didn't even have a matching k-mer in common with the reference transcriptome, which means they are exceedingly unlikely to have come from the transcripts that were indexed.
Further explanation of what these metadata numbers mean would be very helpful to me.
Also useful would be a statistic (or more than one statistic) that fully categorises the read alignments or non-alignments.
Desktop (please complete the following information):
- OS: Debian
uname-a: Linux musculus 6.7.9-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.7.9-2 (2024-03-13) x86_64 GNU/Linuxlsb_release -a
No LSB modules are available.
Distributor ID: Debian
Description: Debian GNU/Linux trixie/sid
Release: n/a
Codename: trixie
Additional context
I'm not really after an explanation of why read mapping rates are low in my specific case, I'm after an explanation in the documentation of why read mapping rates from Salmon are generally low.
Update: I've just confirmed that trimming doesn't have any substantial impact on the results for our first sample [AG1232_001_MR1].