salmon icon indicating copy to clipboard operation
salmon copied to clipboard

Removing PCR duplicates

Open b-dawes opened this issue 8 years ago • 8 comments

Hello,

My lab is in the process of testing out Salmon and potentially switching to it from traditional aligners. With a traditional aligner, we use picard's MarkDuplicates to remove PCR duplicates. Is there a way to remove PCR duplicates with Salmon? I tried using --writeMappings to generate a BAM and feed that into picard, but I just get a SAM validation error: "Not primary alignment flag should not be set for unmapped read."

Best, Brian

b-dawes avatar May 31 '17 18:05 b-dawes

Hi Brian,

In general, I would argue that one should be cautious with removing PCR duplicates in RNA-seq data (unless you are dealing with reads with UMI tags). This is because reads that align to the same reference position can easily have come from alternative transcripts sharing the same underlying sequence. Hence, the normal tests used to infer PCR duplicates with e.g. DNA-seq reads can yield false-positives in RNA-seq. This is particularly true for highly abundant transcripts (or transcripts from highly-abundant genes).

We are currently working on the code that will do duplicate removal when UMI tags are present. That methodology can be extended to remove duplicates even without UMI tags --- though I'd generally caution against that for the reasons mentioned above. However, for the time being, if you have a strong need or desire to filter PCR duplicates, you could use alignment-based Salmon with a BAM file that has duplicates removed.

Finally, regarding the error you are getting during SAM validation; this sounds like a different issue. Would you mind providing a piece of that SAM file for me to take a look at? Specifically, I don't believe the quasi-mapping output file should even contain unmapped reads (unless you consider unmapped mates of orphaned reads).

--Rob

rob-p avatar May 31 '17 21:05 rob-p

+1 on duplicate marking for RNA-seq data being counter productive. Picard can be pretty particular about the format of the SAM file, if you really want to mark the duplicates, sambamba is a lot faster and should complain less:

http://lomereiter.github.io/sambamba/docs/sambamba-markdup.html

roryk avatar May 31 '17 22:05 roryk

Thanks for the responses.

Here's the records for the first read in my bam:

NB501285:81:HVJFGBGX2:1:11101:7287:1044 105     uc008cdk.2|C4b  2845    1       36M     =       2845    0       AGATTCTCCAAATTGAGAAGGAAGGAGCCATCCACA    *       NH:i:3
NB501285:81:HVJFGBGX2:1:11101:7287:1044 149     uc008cdk.2|C4b  2845    0       *       =       2845    0       ACCCTNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN    *       NH:i:3
NB501285:81:HVJFGBGX2:1:11101:7287:1044 361     uc008cdl.2|C4b  2845    1       36M     =       2845    0       AGATTCTCCAAATTGAGAAGGAAGGAGCCATCCACA    *       NH:i:3
NB501285:81:HVJFGBGX2:1:11101:7287:1044 405     uc008cdl.2|C4b  2845    0       *       =       2845    0       ACCCTNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN    *       NH:i:3
NB501285:81:HVJFGBGX2:1:11101:7287:1044 361     uc008cdp.1|C4a  2814    1       36M     =       2814    0       AGATTCTCCAAATTGAGAAGGAAGGAGCCATCCACA    *       NH:i:3
NB501285:81:HVJFGBGX2:1:11101:7287:1044 405     uc008cdp.1|C4a  2814    0       *       =       2814    0       ACCCTNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN    *       NH:i:3

Picard crashes on the fourth record with error:

ERROR: Record 4, Read name NB501285:81:HVJFGBGX2:1:11101:7287:1044, Not primary alignment flag should not be set for unmapped read.

It looks like the issue is that Salmon is setting bit 4 (read unmapped) for the mates and is setting bit 256 (not primary alignment) for the secondary alignments. Both bits 4 and 256 are set for the mate of a secondary alignment which picard doesn't like.

b-dawes avatar Jun 01 '17 16:06 b-dawes

@b-dawes Did you bring it to picard developers? I do not say salmon is not broken, but ... getting more opinions woul dbe welcome. @droazen Clues?

mmokrejs avatar Jun 25 '18 20:06 mmokrejs

Hi @mmokrejs

I never did bring it up to the picard developers.

b-dawes avatar Jun 25 '18 21:06 b-dawes

Hi Brian,

In general, I would argue that one should be cautious with removing PCR duplicates in RNA-seq data (unless you are dealing with reads with UMI tags). This is because reads that align to the same reference position can easily have come from alternative transcripts sharing the same underlying sequence. Hence, the normal tests used to infer PCR duplicates with e.g. DNA-seq reads can yield false-positives in RNA-seq. This is particularly true for highly abundant transcripts (or transcripts from highly-abundant genes).

We are currently working on the code that will do duplicate removal when UMI tags are present. That methodology can be extended to remove duplicates even without UMI tags --- though I'd generally caution against that for the reasons mentioned above. However, for the time being, if you have a strong need or desire to filter PCR duplicates, you could use alignment-based Salmon with a BAM file that has duplicates removed.

Finally, regarding the error you are getting during SAM validation; this sounds like a different issue. Would you mind providing a piece of that SAM file for me to take a look at? Specifically, I don't believe the quasi-mapping output file should even contain unmapped reads (unless you consider unmapped mates of orphaned reads).

--Rob

It is in the latest Salmon release?

Thanks

antgomo avatar Dec 11 '18 12:12 antgomo

I'm curious if there is code available to remove the duplicates based on UMI tags for RNA. I'm aware of Alevin, but I don't think this suits my purpose

Regards, Steven

stevennevins avatar Jun 06 '19 13:06 stevennevins

Hi, is there any news on removing PCR duplicates in salmon? I think this is a major fallacy of this software when using paired end reads or reads with UMIs. Sometimes PCR duplication can be quite different between replicates and with paired end read it should be quite widely accepted that those are real PCR duplicates and should be removed.

Thanks!

ChenfuShi avatar Nov 21 '19 15:11 ChenfuShi