leafcutter
leafcutter copied to clipboard
Behavior of -p --mincluratio: unexpectedly removing intron clusters
Hi, thanks for developing leafcutter!
I'm trying to understand the impact of the option -p in the intron clustering step regarding the generation of "*_refined" from "*_pooled" (using leafcutter_cluster_regtools.py).
On a differential splicing analysis setting (cases vs. controls), I've been clustering junctions from a heterogeneous group of samples (i.e. different datasets, tissues, varying sequencing depth, etc), and comparing that with clustering on a per dataset/tissue basis. With the latter approach, I see a nice exon inclusion event in a gene of interest, matching my expectation from a differential transcript usage analysis.
However, when I do clustering on the larger and heterogeneous set of samples, I first get in the "*_pooled" file an enormous cluster with 495 junctions that includes my event of interest, but extends beyond the gene. While some junctions have only 3 reads, others have >170,000.
Then, if I use -p 0.00001, I get in the "*_refined" file a smaller cluster with 14 junctions, including my splicing event of interest, with the number of reads per junction ranging from 98 to 170,000. But the leafviz plot is noisy since I include many junctions with few supporting reads, and the delta PSI is small since it's relative to the cluster.
However, if I use -p 0.05, I lose the entire cluster in "*_refined", even though my event of interest is supported by >96,000 reads. Is that a desired behavior?
I'd love to hear from @davidaknowles, @goldenflaw et al.
Thank you! Vitor
Hi Vitor,
This may be an edge case which we will try to fix in future versions. For now, if you already know which junction/cluster you are interested in. You can simply manually define clusters (I believe) by overriding the _refined file.
Best, Yang
On Mon, Nov 21, 2022 at 3:46 PM Vitor @.***> wrote:
Hi, thanks for developing leafcutter!
I'm trying to understand the impact of the option -p in the intron clustering step regarding the generation of "_refined" from "_pooled" (using leafcutter_cluster_regtools.py).
On a differential splicing analysis setting (cases vs. controls), I've been clustering junctions from a heterogeneous group of samples (i.e. different datasets, tissues, varying sequencing depth, etc), and comparing that with clustering on a per dataset/tissue basis. With the latter approach, I see a nice exon inclusion event in a gene of interest, matching my expectation from a differential transcript usage analysis.
However, when I do clustering on the larger and heterogeneous set of samples, I first get in the "*_pooled" file an enormous cluster with 495 junctions that includes my event of interest, but extends beyond the gene. While some junctions have only 3 reads, others have >170,000.
Then, if I use -p 0.00001, I get in the "*_refined" file a smaller cluster with 14 junctions, including my splicing event of interest, with the number of reads per junction ranging from 98 to 170,000. But the leafviz plot is noisy since I include many junctions with few supporting reads, and the delta PSI is small since it's relative to the cluster.
However, if I use -p 0.05, I lose the entire cluster in "*_refined", even though my event of interest is supported by >96,000 reads. Is that a desired behavior?
I'd love to hear from @davidaknowles https://github.com/davidaknowles, @goldenflaw https://github.com/goldenflaw et al.
Thank you! Vitor
— Reply to this email directly, view it on GitHub https://github.com/davidaknowles/leafcutter/issues/221, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABGWTCPASP6BUBFLVCFH4NDWJPUT7ANCNFSM6AAAAAASHC6L6Y . You are receiving this because you were mentioned.Message ID: @.***>