salmon
salmon copied to clipboard
automatic library type detection infers correct orientation but still includes other reads for quantification
The result of salmon quant
(quasi-mapping + SA) for the same paired-end read input differs when run with --libType A
versus --libType ISF
, even though Salmon correctly infers that the library orientation is ISF when run with --libType A
.
In particular, the sum of the NumReads column from quant.sf
is greater for the --libType A
result than for the --libType ISF
result. Upon inspection, I discovered that this is because even when --libType A
correctly infers that the orientation is ISF, it still uses the ~4500 reads (< 1%) that were in orientation ISR.
I interpreted the --libType A
flag to mean that salmon
automatically infers the library orientation to be type X, then that means it will only use reads that are compatible with library orientation X in the quantification. Was that the intention of --libType A
?
The two commands being compared:
salmon quant --libType A --seqBias --gcBias --discardOrphansQuasi --validateMappings -1 mate1.fq -2 mate2.fq -g gene_map.tsv -p 8 -o /path/to/out -i /path/to/index
versus
salmon quant --libType ISF --seqBias --gcBias --discardOrphansQuasi --validateMappings -1 mate1.fq -2 mate2.fq -g gene_map.tsv -p 8 -o /path/to/out -i /path/to/index
.
They are exactly the same except for the value provided to --libType
.
Relevant results when run with --libType A
$ datamash -H sum 5 < quant.sf
> 917202.01
$ cat lib_format_counts.json
{ "read_files": "[ /path/to/mate1.fastq.gz, /path/to/mate2.fastq.gz]", "expected_format": "ISF", "compatible_fragment_ratio": 1.0, > "num_compatible_fragments": 917202, "num_assigned_fragments": 917202, > "num_frags_with_concordant_consistent_mappings": 912759, > "num_frags_with_inconsistent_or_orphan_mappings": 4557, "strand_mapping_bias": 0.995032246248839, "MSF": 0, "OSF": 0, > "ISF": 912759, "MSR": 0, "OSR": 0, > "ISR": 4557, "SF": 0, "SR": 0, "MU": 0, "OU": 0, "IU": 0, "U": 0 }
Relevant results when run with --libType ISF
$ datamash -H sum 5 < quant.sf
> 912758.978
$ cat lib_format_counts.json
{ "read_files": "[ /path/to/mate1.fastq.gz, /path/to/mate2.fastq.gz]", "expected_format": "ISF", "compatible_fragment_ratio": 1.0, > "num_compatible_fragments": 912759, "num_assigned_fragments": 912759, > "num_frags_with_concordant_consistent_mappings": 912759, "num_frags_with_inconsistent_or_orphan_mappings": 0, "strand_mapping_bias": 1.0, "MSF": 0, "OSF": 0, > "ISF": 912759, "MSR": 0, "OSR": 0, "ISR": 0, "SF": 0, "SR": 0, "MU": 0, "OU": 0, "IU": 0, "U": 0 }
Hi, I'm having a similar issue with specification of library type. I'm quantifying a single-end library of type SF with salmon 1.3.0. The two commands being compared are:
auto detect:
salmon --no-version-check quant --writeMappings -i target --incompatPrior 0.0 --validateMappings -l A -r d218056_dedup.fastq -p 4 -o d218056_A.quant > d218056_A.sam
specify SF:
salmon --no-version-check quant --writeMappings -i target --incompatPrior 0.0 --validateMappings -l SF -r d218056_dedup.fastq -p 4 -o d218056_SF.quant > d218056_SF.sam
The log files indicate that salmon correctly identifies the library as SF in the auto case. I noticed the issue when examining a pair of genes with overlapping 3'UTRs. The forward strand gene (GQ67_03478) is expressed at a much lower level than the reverse strand gene (GQ67_03479). The sam files contain the same number of reads mapped to each transcript without regard to how the libtype is specified:
egrep -v '^@' d218056_A.sam|grep -c GQ67_03478
382
egrep -v '^@' d218056_A.sam|grep -c GQ67_03479
399
egrep -v '^@' d218056_SF.sam|grep -c GQ67_03478
382
egrep -v '^@' d218056_SF.sam|grep -c GQ67_03479
399
The quantitation is very different with 120 counts assigned to the forward strand gene in the A case and a much more accurate (based on examination of the sam file) 10 counts in the SF case:
grep GQ67_03478 d218056_A.quant/quant.sf GQ67_03478T0 2914 2664.000 202.831978 119.926
grep GQ67_03478 d218056_SF.quant/quant.sf
GQ67_03478T0 2914 2664.000 17.066270 10.000
For the reverse strand gene, the auto case undercounts due to reads being assigned to the forward strand gene.
grep GQ67_03479 d218056_A.quant/quant.sf
GQ67_03479T0 1383 1133.000 1245.013842 313.074
grep GQ67_03479 d218056_SF.quant/quant.sf
GQ67_03479T0 1383 1133.000 1589.051981 396.000
I've been using salmon with -l A thinking that if the software correctly recognizes the libtype, the results would be nearly identical to explicitly specifying the libtype but that does not seem to be true.
Is this expected behavior?
thanks
Charlie
Hi @bumproo and @mmp3,
Thanks for raising this issue. So, I am surprised at these particular differences that you found, but the behavior you are observing is consistent with how automatic library type detection works. Let me explain what it's doing, and then I'm open to discussing if we should focus on changing that behavior going forward.
The standard library type detection works by looking at the total number of compatible mappings in both possible orientations for the first x=10,000 aligned reads. These 10,000 reads are themselves mapped with an IU
orientation in paired data and a U
orientation in unpaired data. Once the 10,000 data points have been processed, a heuristic chooses the most likely library type and applies it (and salmon issues a warning if, at the end of quantification,, there are too many reads that disagree). So, the explanation of what could be happening here is that the reads that are different between your runs are coming within the first set of 10,000 aligned reads (note, this may not be the first 10k reads of the file, because concurrent processing means that reads from different threads are being aligned in an essentially random order ... but of course maintaining pairing information).
The argument for why this should usually not cause a considerable difference is because (1) the reads are assumed to arrive in a random order (2) this is only a very small fraction of the total data and (3) if the unoriented reads map to a target in the "wrong" direction, they will also align to the proper target in the "right" direction and, once the orientation filter is applied for future reads, the weight of evidence should turn the probability of assignment of these unoriented reads toward the proper target. However, I'm guessing there is an edge case you're seeing here where the conditions don't induce this behavior.
So, there are 2 immediate solutions to the problem. First, if you know the library type explicitly, you can use that. Second (and some other folks have discussed this here for other reasons), you can do a "throw-away" run of salmon on a small prefix of the read file (e.g. salmon quant ... -lA --skipQuant -r <(gunzip -c reads.fq.gz | head -n 400000)
) to get the output of the automatic library type determination, and then run the full dataset with that library type.
Finally, moving forward, I'm happy to consider working on modifying this default behavior. That is, we could (though it would be a little bit of work) modify the default behavior. The idea here is to basically run as we do now for the first 10,000 aligned reads to get the library type and then "reset" the whole quantification pipeline. The main challenge here is that salmon is designed to work with streaming FASTQ input, and we don't want to break that. So we can't do something as easy as "reset the file pointer". I think the best option is to make a copy of the first X reads in memory, detection the library type with them, and then start quantifying them and continue with the rest of the file. That complicates the logic a bit, because now the input source for reads changes dynamically during quantification --- but I think it could be done. Please let me know if you both have interest in this feature and it's worth putting on the list.
Best, Rob
Hi Rob, Thank you very much for the quick response. It is very helpful to understand the behavior of the "-l A" call better and the use the throw-away run to verify the library type so that it can be explicitly set is an excellent solution for me and a change to the software is probably not critical thanks again, both for this answer and the excellent software Charlie
Thank you for the clear and thorough explanation, @rob-p . Now I understand exactly why this is happening.
I like your idea for the “throw-away” run for Salmon, and the short example command you sketched out is exactly what I had said in mind as I read your words.
Reworking the core Salmon algorithm to do some gymnastics with re-processing the first 10,000 reads would not be elegant or worth your time. I think the workaround you proposed is a perfectly good solution.
If in the long run many other people find this useful, perhaps an easier fix would be to make a new command in Salmon that just bails after the first 10k reads automatically and returns the detected library orientation upon termination of the command; e.g. in Bash:
mylibtype=$(salmon quant —getLibType -r reads.fq.gz)
salmon quant —libType $mylibtype -r reads.fq.gz
Thank you for the great software and for being so attentive to detail and our questions.