rnaseq icon indicating copy to clipboard operation
rnaseq copied to clipboard

Improve/add UMI deduplication metrics

Open ppericard opened this issue 10 months ago • 3 comments

Description of feature

Hello ^^ I'm having difficulties finding easy to understand stats on UMI deduplication in the outputs. It seems there is no section in the multiqc output, not even in the statistics table (where I would expect to have metrics about nb of reads before dedup, nb of reads after dedup, % duplication (from umi-tools dedup on alignements)). In the output directory, I'm also not finding a log with easy to understand metrics from umi-tools dedup. I'm probably missing something. Thanks in advance. Pierre

ppericard avatar Apr 02 '24 08:04 ppericard

Since people complained about the poor performance, the generation of deduplication statistics if off by default now.

You have to set the parameter --umitools_dedup_stats respectively umitools_dedup_stats : true in a params file to activate that functionality.

MatthiasZepper avatar Apr 09 '24 15:04 MatthiasZepper

Hi @MatthiasZepper, I'm sorry if i wasn't clear enough in my initial message. All my comments apply to the pipeline while having activated the --umitools_dedup_stats parameter. In the *.umi_dedup.transcriptome.filtered.prepare_for_rsem.log files there are no summaries with the dedup stats, and the other files are not very informative and easy to read: *.umi_dedup.sorted_edit_distance.tsv, *.umi_dedup.sorted_per_umi_per_position.tsv, *.umi_dedup.sorted_per_umi.tsv. There is a real need for an easy to read and understand summary for deduplications, such as the one that can be obtained through Multiqc parsing of the UMI tools for exemple (https://github.com/MultiQC/MultiQC/pull/1769). Right now, as a user I have even less information about deduplication than what I would have in the logs just by running the umi-tools dedup command.

ppericard avatar Apr 24 '24 05:04 ppericard

Apologies for stonewalling on this issue before. While hunting down the cause for issue #1303, it occurred to me that probably a botched MultiQC config is behind this issue as well. For some reason, we explicitly specify the MultiQC modules to be run and UMI-tools is nowhere to be found.

Since we run MultiQC with a custom config outside the pipeline again, we did not notice.

It should be fixed on this branch, but I struggle with testing at the moment.

MatthiasZepper avatar May 25 '24 18:05 MatthiasZepper

#1308 has been merged to dev and will be released as part of rnaseq 3.15. Please give it a spin to see if it solves this issue @ppericard !

MatthiasZepper avatar Jul 12 '24 15:07 MatthiasZepper

@MatthiasZepper Thank you for dealing with this issue. I'm currently taking an extended leave from bioinformatics for the unforseen future. So hopefully someone from the community will be able to test this. Cheers

ppericard avatar Sep 05 '24 10:09 ppericard