salmon icon indicating copy to clipboard operation
salmon copied to clipboard

Salmon Library Types

Open Kisekya opened this issue 3 years ago • 4 comments

Hello to all,

I have a question about salmon and more precisely the alignment part. I use salmon on BAM files of RNAseq data and the problem is that I don't really know what to use as <LIBTYPE>.

I used -l A to get the automatic library but the "meta_info.json" it gives me:

     "salmon_version": "1.4.0",
        "samp_type": "none",
        "opt_type": "vb",
        "quant_errors": [],
        "num_libraries": 1,
        "library_types": [
            "MU"
        ],

"MU" but what puzzles me is that my datas are IlluminaTruSeq Stranded. "MU" is used for unstranded then it's a little weird....

I tested every libraries and MU gives me the better percentage with 60% of mapping, the other ISF, IU, ISR give less than 10-20% maximum

Is there a problem to use MU on stranded knowing that it is salmon who chose automatically ? Especially since this is what gives me the best results...

Thanks for your

Kisekya

Kisekya avatar Jun 22 '21 11:06 Kisekya

Hi @Kisekya,

I wonder if you could say something about how the BAM file itself was prepared. The MU library type is quite exotic and certainly not expected for a standard illumina sample (especially a TruSeq Stranded library). I wonder if it may have something to do with how the BAM file itself was prepared.

Best, Rob

rob-p avatar Jun 22 '21 14:06 rob-p

Hi @rob-p,

Thank you for your reply. It's a BWA-mem2 aligment with this command:

 "bwa-mem2 mem -M -t {threads} -v 2 {input.reference} {input.reads} | samtools view -q 20 -F 3844 --threads {threads} -Sb -> {output.inter_bam} && "
 "samtools sort -@ {threads} -O bam {output.inter_bam} > {output.final_bam} && samtools index -@ {threads} {output.final_bam} && samtools flagstat {output.final_bam} > {output.flag} "

I don't if it's because I use BWA which is a non-splicing aligner? Or because I sorted my BAM file? I am developing a pipeline and the first step is to test the data with bwa-mem2 with salmon.

Is it really a problem to use the results with the MU lib? The 60% may be wrong and reflect alignments that don't exist and ignore good alignments because of the wrong lib?

Edit

I tried to do the mapping and the aligment with salmon:


{
    "read_files": "[ ../results/trimmed/3373-1_CCGCGGTT-CTAGCGCT-AHV5HLDSXY_L004_R1_trimmed.fastq.gz, ../results/trimmed/3373-1_CCGCGGTT-CTAGCGCT-AHV5HLDSXY_L004_R2_trimmed.fastq.gz]",
    "expected_format": "ISR",
    "compatible_fragment_ratio": 1.0,
    "num_compatible_fragments": 35843765,
    "num_assigned_fragments": 35843765,
    "num_frags_with_concordant_consistent_mappings": 29709658,
    "num_frags_with_inconsistent_or_orphan_mappings": 6209768,
    "strand_mapping_bias": 0.000008381042727599249,
    "MSF": 0,
    "OSF": 0,
    "ISF": 249,
    "MSR": 0,
    "OSR": 0,
    "ISR": 29709658,
    "SF": 2520360,
    "SR": 3689159,
    "MU": 0,
    "OU": 0,
    "IU": 0,
    "U": 0
}

It's the ISR library but I have only 40% of mapping , it's really confusing....

Best, Kisekya

Kisekya avatar Jun 22 '21 17:06 Kisekya

Hi @Kisekya,

So BWA-MEM and BWA-MEM2 are somewhat of a problem to begin with because they perform local alignment, which isn't really ideal for aligning RNA-seq reads to the transcriptome. If you really wish to use an aligner, we've had good experiences with Bowtie2 (when used in the appropriate end-to-end alignment mode) and with STAR (using the alignments projected to the transcriptome with --quantMode TranscriptomeSAM flag to output the alignments in transcriptomic coordinates as required by salmon).

Apart from the local alignment issue, sorting the BAM file is absolutely a problem for salmon, and is likely why you get the strange library type. When run in alignment mode, just like RSEM, salmon requires the alignments for the the mates of a read pair to appear subsequently in the file, and for all alignments for a given read to appear contiguously in the file. This allows parsing the reads without having to require potentially unbounded memory (holding the record for one end of a fragment in memory while waiting for the record for the other end). In fact, given that you've sorted the alignments here, I'm surprised you're not getting the "suspicious pair" warnings in your logs.

The ISR library with 40% mapping is likely a more reliable number. The obvious question here is why might the mapping rate be this low? There are a few reasons you might see something like this. One, for example, is poor ribosomal depletion, paired with not having all of the rRNA sequences in your index. In this case, you have many fewer reads coming from the rest of the transcriptome and you get depleted mapping rates like this.

Could you say a bit more about the experimental setup? Is this in a well-annotated organism like human / mouse etc.? Is this a polyA selection or ribosomal depletion prep? Anything else that might be relevant to sample quality?

Best, Rob

rob-p avatar Jul 04 '21 04:07 rob-p

Hi @rob-p, I am aware of that, but we were off on bwa anyway. I decided to follow your advice and used STAR with this order:

"STAR --runThreadN {threads} --runMode alignReads --genomeDir {input.ref} --readFilesIn {input.fq1} {input.fq2} --readFilesCommand zcat --outSAMtype BAM Unsorted SortedByCoordinate --quantMode TranscriptomeSAM GeneCounts --outFileNamePrefix {output} --outStd Log {log} "

I then got this final.out:


                                 Started job on |	Jul 05 07:51:09
                             Started mapping on |	Jul 05 07:51:13
                                    Finished on |	Jul 05 10:01:38
       Mapping speed, Million of reads per hour |	39.36

                          Number of input reads |	85547657
                      Average input read length |	298
                                    UNIQUE READS:
                   Uniquely mapped reads number |	36980651
                        Uniquely mapped reads % |	43.23%
                          Average mapped length |	283.47
                       Number of splices: Total |	943061
            Number of splices: Annotated (sjdb) |	0
                       Number of splices: GT/AG |	411198
                       Number of splices: GC/AG |	39101
                       Number of splices: AT/AC |	13983
               Number of splices: Non-canonical |	478779
                      Mismatch rate per base, % |	0.56%
                         Deletion rate per base |	0.03%
                        Deletion average length |	4.89
                        Insertion rate per base |	0.03%
                       Insertion average length |	4.88
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |	1029261
             % of reads mapped to multiple loci |	1.20%
        Number of reads mapped to too many loci |	565
             % of reads mapped to too many loci |	0.00%
                                  UNMAPPED READS:
  Number of reads unmapped: too many mismatches |	0
       % of reads unmapped: too many mismatches |	0.00%
            Number of reads unmapped: too short |	47533174
                 % of reads unmapped: too short |	55.56%
                Number of reads unmapped: other |	4006
                     % of reads unmapped: other |	0.00%
                                  CHIMERIC READS:
                       Number of chimeric reads |	0
                            % of chimeric reads |	0.00%

I filtered it by samtools -f 2 -F 3840

and Salmon gave me this result which is still very weak: 24323638 counts

So I decided to reduce the parameters as indicated in this link: https://github.com/alexdobin/STAR/issues/169 Because I trimmed my sequence and some can have a size between 100pb -150pb

"STAR --runThreadN {threads} --runMode alignReads --genomeDir {input.ref} --readFilesIn {input.fq1} {input.fq2} --readFilesCommand zcat --outSAMtype BAM Unsorted SortedByCoordinate --outFilterScoreMinOverLread 0 --outFilterMatchNminOverLread 0 --quantMode TranscriptomeSAM GeneCounts --outFileNamePrefix {output} --outStd Log {log} "

I got this final.out:

                                 Started job on |	Jul 05 14:25:19
                             Started mapping on |	Jul 05 14:25:23
                                    Finished on |	Jul 05 16:37:44
       Mapping speed, Million of reads per hour |	38.78

                          Number of input reads |	85547657
                      Average input read length |	298
                                    UNIQUE READS:
                   Uniquely mapped reads number |	70090369
                        Uniquely mapped reads % |	81.93%
                          Average mapped length |	191.51
                       Number of splices: Total |	1068826
            Number of splices: Annotated (sjdb) |	0
                       Number of splices: GT/AG |	470490
                       Number of splices: GC/AG |	43525
                       Number of splices: AT/AC |	15865
               Number of splices: Non-canonical |	538946
                      Mismatch rate per base, % |	1.29%
                         Deletion rate per base |	0.03%
                        Deletion average length |	4.76
                        Insertion rate per base |	0.03%
                       Insertion average length |	5.23
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |	15205492
             % of reads mapped to multiple loci |	17.77%
        Number of reads mapped to too many loci |	247790
             % of reads mapped to too many loci |	0.29%
                                  UNMAPPED READS:
  Number of reads unmapped: too many mismatches |	0
       % of reads unmapped: too many mismatches |	0.00%
            Number of reads unmapped: too short |	0
                 % of reads unmapped: too short |	0.00%
                Number of reads unmapped: other |	4006
                     % of reads unmapped: other |	0.00%
                                  CHIMERIC READS:
                       Number of chimeric reads |	0
                            % of chimeric reads |	0.00%

It was really better but I am afraid that I have really low quality (I try the parameter 0.3 when I wrote these lines ), I filtered again with samtools -f 2 -F3840 and the salmon counts which is still very low : 24323720 counts

I used samtools flagstat to see what happens after the filter and we get this?

48983692 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
48983692 + 0 mapped (100.00% : N/A)
48983692 + 0 paired in sequencing
24491846 + 0 read1
24491846 + 0 read2
48983692 + 0 properly paired (100.00% : N/A)
48983692 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

I don't understand why I'm losing so many counts, is it because I'm filtering? But still I have to filter to get the properly pairs...

For the sorting it's totally my fault I read the doc wrong but even by not sorting I get very low results not usable less than 26%.

The experimentation is done on oak, on 4 times 3 late samples and 3 early samples of dormancy were recovered and we made a TruSeq stranded illumina on these samples. I use a gene model built by my team with the 25808 genes that the oak has as reference.

For this part "Is this a polyA selection or ribosomal depletion prep" I don't know, I'll find out.

To be honest I am totally lost because I don't understand what's wrong in my analysis....

Thank you very much for your help once again

Kisekya

EDIT:

I discover that I have 59 millions of duplicates in my data... I tried to delete it after filtering my proper pair I get bad records 38% of mapping ...

Kisekya avatar Jul 06 '21 07:07 Kisekya