salmon icon indicating copy to clipboard operation
salmon copied to clipboard

Strand Bias ~ 0.96 and low mapping rate

Open GianlucaMattei opened this issue 3 years ago • 4 comments

Hello, I am using salmon for an RNAseq experiment but I m getting very low mapping rate (0.03%!) and a strong strand bias (0.96). It never happened before and I can not understand the cause. I have already used FastQC to check the quality and it is good. It happens also with all the other samples (N=6) the service sent to us for the same RNAseq experiment. At the moment I m afraid the service did a mistake or could this problem be related to the library type?

This is the command I ran:

salmon quant -i /somewhere/in/the/server/index/salmon -l A -1 /somewhere/in/the/server/fastq/sample1_1.fq.gz -2 /somewhere/in/the/server/fastq/sample1_2.fq.gz -p 10 --seqBias --gcBias --validateMapping -o /somewhere/in/the/server/salmon

lib_format_counts.json:

{
    "read_files": [
        "/somewhere/in/the/server/fastq/sample1_1.fq.gz",
        "/somewhere/in/the/server/fastq/sample1_2.fq.gz"
    ],
    "expected_format": "IU",
    "compatible_fragment_ratio": 1.0,
    "num_compatible_fragments": 6520,
    "num_assigned_fragments": 6520,
    "num_frags_with_concordant_consistent_mappings": 2126,
    "num_frags_with_inconsistent_or_orphan_mappings": 4704,
    "strand_mapping_bias": 0.964722483537159,
    "MSF": 0,
    "OSF": 0,
    "ISF": 75,
    "MSR": 0,
    "OSR": 0,
    "ISR": 2051,
    "SF": 1436,
    "SR": 3268,
    "MU": 0,
    "OU": 0,
    "IU": 0,
    "U": 0
}

few lines from salmon_quant.log:

Only 6520 fragments were mapped, but the number of burn-in fragments was set to 5000000.
The effective lengths have been computed using the observed mappings.

Mapping rate = 0.0301431%

GianlucaMattei avatar Jun 23 '21 09:06 GianlucaMattei

I have the same issue :(

kate-simonova avatar Feb 11 '22 15:02 kate-simonova

Hi @kate-simonova,

How low is your mapping rate? Can you explicitly pass the flag -l IU for library type? This strand bias means that the data look stranded however. Also, can you mention how the mapping rate changes if you add --softclip and/or if you lower --minScoreFraction? If the mapping rate is very low, this could signify a failure of the sample to match the reference well. Could you post the contents of meta_info.json?

Best, Rob

rob-p avatar Feb 11 '22 15:02 rob-p

Dear Rob,

thank you for fast response.

I just found out that I dont use the latest version of salmon - could it be the issue?!

{
    "salmon_version": "0.13.1",
    "samp_type": "none",
    "opt_type": "vb",
    "quant_errors": [],
    "num_libraries": 1,
    "library_types": [
        "IU"
    ],
    "frag_dist_length": 1001,
    "seq_bias_correct": false,
    "gc_bias_correct": false,
    "num_bias_bins": 4096,
    "mapping_type": "mapping",
    "num_targets": 86774,
    "serialized_eq_classes": false,
    "eq_class_properties": [],
    "length_classes": [
        553,
        826,
        1654,
        3040,
        100228
    ],
    "index_seq_hash": "8265e19233c976854a2856310b12551f166fa5b44ab1f0d83a36230b7aaa7b75",
    "index_name_hash": "8c117139f22fe26df5c3c865ff1241308ebd75d80caa003ab84fc40c59536e2f",
    "index_seq_hash512": "fcc6dfb6bc12ab97bc313035978f58983cef3af6cafa51b44825da1f950808a55ba28c92264d85668527279c59ab7f918eb3b87b12231cb85006371a73308b83",
    "index_name_hash512": "0139064e5c9c6029b2cd4b44c6c57a304c583e2ac58b00f498f5b22f6828b4ca41fcb80e1ea75cfba636d3f07fd81ecf16b1a0e26d367b231b9cb8164cd83548",
    "num_bootstraps": 0,
    "num_processed": 41427244,
    "num_mapped": 3369623,
    "num_dovetail_fragments": 1535077,
    "num_fragments_filtered_vm": 2021515,
    "num_alignments_below_threshold_for_mapped_fragments_vm": 1387081,
    "percent_mapped": 8.133833377861196,
    "call": "quant",
    "start_time": "Fri Feb 11 18:53:28 2022",
    "end_time": "Fri Feb 11 19:22:13 2022"

Yes the mapping rate is low though when I run Fastqscreen I can see that over 75 % of reads map to the mouse genome. I first tried to map the latest transcriptome version - Mus Mucus GRc39, then I specifically used ensembl mus_musculus_c57bl6nj. I got the same result. I can also attach fastqc and fastqscreen results if needed.

Thank you.

Ekaterina

kate-simonova avatar Feb 11 '22 18:02 kate-simonova

Hi @kate-simonova,

While I would certainly recommend updating to the latest version of salmon (which, given the pre 1.0.0 to post 1.0.0 difference would require you to rebuild the index), I don't think that would have a substantial effect on a mapping rate that is this low.

If the Fastqscreen report suggests that most of the reads map to the genome (>75%), but you are seeing an 8% mapping rate in salmon, this highly suggests that most of the reads are, for some reason, arising from outside of an annotated gene. I would then have two suggestions to test out:

1.) Check for mtRNA contamination. Try adding extra mitochondrial RNA to your reference fasta, re-indexing, and re-quantifying. If mtRNA depletion or polyA enrichment failed, then it's possible that you have most of your RNA-seq reads coming from mt genes. I've seen this before a number of times and it results in a situation where most of the reads map back to the genome — but not the annotated transcriptome, which often has an incomplete set of mtRNA sequences.

2.) Try mapping the reads to the genome and see how many reads overlap known genes. This is what you would do with a "counting-based" RNA-seq pipeline, so something like STAR+feature-counts or subread+feature-counts. While I would generally not recommend this for quantification, it can be instructive to see the fraction of reads that map to the genome but not to known transcripts. Likewise, you could (with the newest salmon) build an index on the transcriptome with the genome added as a decoy (see about our decoy-aware indexing), then the meta_info.json will let you know the fraction of reads that were discarded because they were best matched to a decoy sequence (in this case, the genome, but not some annotated transcript).

This should help clarify what's going on, and might suggest some issues with the sample that are preventing a reasonable mapping rate to the annotated transcriptome.

Best, Rob

rob-p avatar Feb 11 '22 19:02 rob-p