drop Aberrant splicing is taking too long with STRAND=yes

Aberrant splicing is taking too long with STRAND=yes

Open gdeoli opened this issue 1 year ago • 3 comments

Hi,

I'm attempting to run aberrantSplicing module using drop v1.1.1 on my own dataset (n=50). My library is strand-specifc, but I want to check what would be the differences in drop outputs in case STRAND is set to no compared to yes.

It occurs that when this parameter is set to no drop runs in about 24h. However, as I changed the parameter to yes, the tasks are taking too long (Just to give you an idea, I have started the task on Friday July 29th, and up to now it is still running). Nothing in the logs shows any errors, besides the warning: "The index file is older than the data file".

It appears the process is stuck at the rule AberrantSplicing_pipeline_Counting_01_1_countRNA_splitReads_samplewise_R.

I'm running drop as followed: snakemake aberrantExpression --cores 20

I'm running the task on a machine with a minimum of 220 GiB and 32 threads. Up to now, there is no issue with memory.

Any thoughts?

Aug 04 '22 13:08 gdeoli

Thanks for using DROP, we recommend you upgrade your DROP so that any issues aren't a result of legacy versions and issues that have been fixed. We are currently on version 1.2.2

Secondly, the STRAND column should reflect the data you are using, according to the htseq-count documentation (the underlying counting strategy)

Important: The default for strandedness is yes. If your RNA-Seq data
has not been made with a strand-specific protocol,
this causes half of the reads to be lost. 
Hence, make sure to set the option --stranded=no unless you have strand-specific data!

So if you set it originally to be NO when in fact you do have stranded data it was probably reducing the number of reads which is why the counting went faster. I would make sure the sample annotation matches your experiment and continue.

Do you have an idea of how many reads you are trying to count?

@vyepez88 Do you have a good idea of how long counting should take for large samples?

Aug 04 '22 15:08 nickhsmith

if the strand is set to no, the split read counting of the aberrant splicing module tries to infer the strand of each read using the BSgenome package. On my experience, this doesn't take considerably larger amounts of time. The BSgenome package connects to an online database to extract the reference genome. Could be that that is what is taking lots of time. Overall, if your data is already stranded, I would simply count it that way.

Aug 08 '22 07:08 vyepez88

Hi @gdeoli, were you able to check this?

Sep 06 '22 06:09 vyepez88

drop drop copied to clipboard

Aberrant splicing is taking too long with STRAND=yes

drop
drop copied to clipboard