drop
drop copied to clipboard
Estimated run-time for FRASER2
Hi,
I was wondering how long a typical FRASER2 run is expected to be?
If I peek at an ongoing job, it seems to hang or not update for at least a few days (5 days in this example), the tail is shown below:
Loading assay: rawCountsJ
Loading assay: psi5
Loading assay: psi3
Loading assay: rawOtherCounts_psi5
Loading assay: rawOtherCounts_psi3
Loading assay: rawCountsJnonsplit
Loading assay: jaccard
Loading assay: rawOtherCounts_jaccard
Loading assay: delta_jaccard
Loading assay: delta_psi5
Loading assay: delta_psi3
Loading assay: rawCountsSS
Loading assay: theta
Loading assay: rawOtherCounts_theta
Loading assay: delta_theta
Fri Jun 2 01:56:41 2023: jaccard
dPsi filter:FALSE: 45269 TRUE: 122243
Exclusion matrix: FALSE: 82522 TRUE: 84990
Fri Jun 2 01:57:52 2023: Injecting 266414 outliers ...
Fri Jun 2 07:01:04 2023: Run hyper optimization with 12 options.
I'm running 314 samples through FRASER2 using DROP v1.3.3 with the following config:
aberrantSplicing:
run: true
groups:
- group1
recount: false
longRead: false
keepNonStandardChrs: true
filter: true
minExpressionInOneSample: 20
quantileMinExpression: 10
minDeltaPsi: 0.05
implementation: PCA
padjCutoff: 0.1
maxTestedDimensionProportion: 6
genesToTest: null
FRASER_version: "FRASER2"
deltaPsiCutoff : 0.1
quantileForFiltering: 0.75
My compute set-up is as follows (10 cores, 60GB each, run-time of 1 week):
#BSUB -q long
#BSUB -P bio
#BSUB -W 168:00
#BSUB -J drop
#BSUB -o logs/drop_%J.stdout
#BSUB -e logs/drop_%J.stderr
#BSUB -R "span[hosts=1] rusage[mem=60000]"
#BSUB -M 60000
#BSUB -n 10
Do you have any estimations with a similar number of samples? Do I trust that this is still running and wait longer?
Many thanks,
Chris
Hi @chrisodhams , sorry for the late reply. A single FRASER2 fit should be rather quick, for ~300 samples I would expect about an hour or two with 10-20 cores. However, during the hyperparameter search we run a lot of different fits for the different latent space sizes tested, so this can take a bit longer. On our cluster this typically takes 1-2 days, depending on sample size and number of cores. So >5 days as in your case seems indeed rather slow from our experience. As you already submitted this issue 2 weeks ago, did it finish running in the meantime, or did it eventually fail? In case it didn't work, one suggestion could be to see if increasing the number of cores and/or the memory to ~100G (if possible) helps.
Hi @ischeller , Thanks for this information. The run above actually timed out after a week of run time - it didn't progress past the 'Fri Jun 2 07:01:04 2023: Run hyper optimization with 12 options.' stage.
I've rerun now with 120GB memory on 5 cores (using a single-node with max of 700GB memory) and set max run time of 2 weeks. If this fails I will run with 20 cores 120GB each and span across multiple nodes.
Thanks,
Hi @ischeller,
Thanks for getting back.
I've rerun using 700GB total memory split over 10 cores (70GB per core) for FRASER2 using 314 samples (using the same config above).
It is still running after 2 weeks - with the last line at June 22 as: 'Thu Jun 22 02:28:21 2023: Run hyper optimization with 12 options.' It has been stuck on this for 13 days and no temporary outputs have been generated.
I've limited the sample set to 76 and rerun with the same compute 700GB total memory split over 10 cores and same config. It's still stuck on the hyper optimization step.
Any ideas what is happening here?
Thanks,
Hi Chris, not sure, there's no reason why it would stop in the run hyper parameter optimization. I recently tried in a cohort of ~200 samples and it fully ran in the usual 3-4 hours in our server. Can you try the following:
- In R, load the fraser dataset object and check its dimensions by executing
fds <- loadFraserDataSet('{root}/processed_data/aberrant_splicing/datasets/', name = '{DROP_GROUP}')
dim(fds)
What are the values of: Number of junctions and Number of splice sites?
- Check the split and non split counts by executing
counts(fds, type = 'psi3') # for split counts
counts(fds, type = 'theta') # for non split counts
Then maybe check the total counts per sample by executing colSums
on the previous matrices. Could it be that 1 sample has 0 counts?
- Try with a group of 10 samples only. You can create a DROP_GROUP called e.g.
small
, add it to the config file and executesnakemake --cores X aberrantSplicing --rerun-triggers mtime
Hi @vyepez88 , Thanks for getting back, so:
> fds <- loadFraserDataSet(dir=workingDir, name=dataset)
> dim(fds)
[1] 167512 314
> counts(fds, type = 'psi3')
<167512 x 314> matrix of class DelayedMatrix and type "integer":
X Y ... Z
[1,] 0 3 . 0
[2,] 0 0 . 0
[3,] 0 0 . 0
[4,] 0 2 . 0
[5,] 0 0 . 0
... . . . .
[167508,] 2 17 . 7
[167509,] 9 0 . 11
[167510,] 9 19 . 21
[167511,] 7 6 . 13
[167512,] 19 3 . 26
> counts(fds, type = 'theta')
<289357 x 314> matrix of class DelayedMatrix and type "integer":
X Y ... Z
[1,] 2 0 . 1
[2,] 0 4 . 0
[3,] 0 0 . 0
[4,] 0 0 . 0
[5,] 0 1 . 0
... . . . .
[289353,] 16 8 . 8
[289354,] 4 4 . 11
[289355,] 0 0 . 2
[289356,] 8 21 . 15
[289357,] 26 9 . 10
> summary(colSums(counts(fds, type = 'psi3')))
Min. 1st Qu. Median Mean 3rd Qu. Max.
5911485 12360005 14981966 15364413 17385947 36524542
> summary(colSums(counts(fds, type = 'theta')))
Min. 1st Qu. Median Mean 3rd Qu. Max.
3059872 5146628 6048280 6228099 6981944 13637833
> summary(rowSums(counts(fds, type = 'psi3')))
Min. 1st Qu. Median Mean 3rd Qu. Max.
20 2205 9035 28800 27776 9691762
> summary(rowSums(counts(fds, type = 'theta')))
Min. 1st Qu. Median Mean 3rd Qu. Max.
0 1065 2723 6759 6575 6062816
> length(which(rowSums(counts(fds, type = 'theta')) == 0))
[1] 151
So there are no zero counts in the columns (samples) for split and non-split, but there are 0 counts in 151 rows (junctions) for the for non-split counts (theta). Would this be a problem? How best to remove these rows of 0 counts?
Just attempting now with the ten sample group - will let you know.
Many thanks.
Hi Chris, so all looks good, the number of junctions, splice sites and reads. It is fine if there are rows with all 0 non-split reads in splice sites, that means that for that splice site, all reads are spliced. Let me know how it goes for the 10 sample group. Btw, you were able to run the demo, right?
Hi Chris, how did it go with the 10 samples?
Hi @vyepez88,
Sorry I was still waiting for confirmation of jobs to complete.
I ran for a subset of 39 samples as a test and it still did not complete the hyper optimization step within ~48hours (I can try with 10 samples but I think it will be the same story).
Using the code within the DROP pipeline to set the value of q, with a sample set of 39, the values of q are:
> unique(round(exp(seq(log(2),log(6.5),length.out = 6))))
[1] 2 3 4 5 6
I manually set q as 4 and continued using the FRASER R package manually, and all the subsequent steps run successfully (fit, calculateZscore, calculatePvalues, etc). These all completed in the time frame expected.
It's very hard to debug what is going on here as there are no temporary outputs/messages but however I try with the sample size and memory/core allocation - it just does not pass the hyper optimization step.
Good that at least it worked after setting the q. Can you try to run the demo?
Just setting that up now. I'm sure this ran successfully when we initially installed DROP but I'll run again just to confirm!
Hi @vyepez88 ,
This might take some time - I'm coming across different errors now and trying to debug.
Just to say that a collaborator who is working on the same dataset in the same environment apparently got the aberrant splicing module working. The only difference I see is in the config is keepNonStandardChrs: false
; whereas I have it set to true
due to https://github.com/gagneurlab/drop/issues/454. I would be surprised if this is affecting anything however.
Hi @chrisodhams, do you have any updates on this? I think you were able to successfully run it, or?
Hi @vyepez88 - haven't had time to check i'm afraid - will hopefully be in the new year. I was blocked by https://github.com/gagneurlab/drop/issues/489 for a while - but this is apparently resolved. Will check ASAP. Cheers,