drop icon indicating copy to clipboard operation
drop copied to clipboard

Estimated run-time for FRASER2

Open chrisodhams opened this issue 1 year ago • 13 comments

Hi,

I was wondering how long a typical FRASER2 run is expected to be?

If I peek at an ongoing job, it seems to hang or not update for at least a few days (5 days in this example), the tail is shown below:

Loading assay: rawCountsJ
Loading assay: psi5
Loading assay: psi3
Loading assay: rawOtherCounts_psi5
Loading assay: rawOtherCounts_psi3
Loading assay: rawCountsJnonsplit
Loading assay: jaccard
Loading assay: rawOtherCounts_jaccard
Loading assay: delta_jaccard
Loading assay: delta_psi5
Loading assay: delta_psi3
Loading assay: rawCountsSS
Loading assay: theta
Loading assay: rawOtherCounts_theta
Loading assay: delta_theta
Fri Jun  2 01:56:41 2023: jaccard
dPsi filter:FALSE: 45269        TRUE: 122243
Exclusion matrix: FALSE: 82522  TRUE: 84990
Fri Jun  2 01:57:52 2023: Injecting 266414 outliers ...
Fri Jun  2 07:01:04 2023: Run hyper optimization with 12 options.

I'm running 314 samples through FRASER2 using DROP v1.3.3 with the following config:

aberrantSplicing:
    run: true
    groups:
        - group1
    recount: false
    longRead: false
    keepNonStandardChrs: true
    filter: true
    minExpressionInOneSample: 20
    quantileMinExpression: 10
    minDeltaPsi: 0.05
    implementation: PCA
    padjCutoff: 0.1
    maxTestedDimensionProportion: 6
    genesToTest: null
    FRASER_version: "FRASER2"
    deltaPsiCutoff : 0.1
    quantileForFiltering: 0.75

My compute set-up is as follows (10 cores, 60GB each, run-time of 1 week):

#BSUB -q long
#BSUB -P bio
#BSUB -W 168:00
#BSUB -J drop
#BSUB -o logs/drop_%J.stdout
#BSUB -e logs/drop_%J.stderr
#BSUB -R "span[hosts=1] rusage[mem=60000]"
#BSUB -M 60000
#BSUB -n 10

Do you have any estimations with a similar number of samples? Do I trust that this is still running and wait longer?

Many thanks,

Chris

chrisodhams avatar Jun 06 '23 11:06 chrisodhams

Hi @chrisodhams , sorry for the late reply. A single FRASER2 fit should be rather quick, for ~300 samples I would expect about an hour or two with 10-20 cores. However, during the hyperparameter search we run a lot of different fits for the different latent space sizes tested, so this can take a bit longer. On our cluster this typically takes 1-2 days, depending on sample size and number of cores. So >5 days as in your case seems indeed rather slow from our experience. As you already submitted this issue 2 weeks ago, did it finish running in the meantime, or did it eventually fail? In case it didn't work, one suggestion could be to see if increasing the number of cores and/or the memory to ~100G (if possible) helps.

ischeller avatar Jun 20 '23 13:06 ischeller

Hi @ischeller , Thanks for this information. The run above actually timed out after a week of run time - it didn't progress past the 'Fri Jun 2 07:01:04 2023: Run hyper optimization with 12 options.' stage.

I've rerun now with 120GB memory on 5 cores (using a single-node with max of 700GB memory) and set max run time of 2 weeks. If this fails I will run with 20 cores 120GB each and span across multiple nodes.

Thanks,

chrisodhams avatar Jun 21 '23 10:06 chrisodhams

Hi @ischeller,

Thanks for getting back.

I've rerun using 700GB total memory split over 10 cores (70GB per core) for FRASER2 using 314 samples (using the same config above).

It is still running after 2 weeks - with the last line at June 22 as: 'Thu Jun 22 02:28:21 2023: Run hyper optimization with 12 options.' It has been stuck on this for 13 days and no temporary outputs have been generated.

I've limited the sample set to 76 and rerun with the same compute 700GB total memory split over 10 cores and same config. It's still stuck on the hyper optimization step.

Any ideas what is happening here?

Thanks,

chrisodhams avatar Jul 05 '23 15:07 chrisodhams

Hi Chris, not sure, there's no reason why it would stop in the run hyper parameter optimization. I recently tried in a cohort of ~200 samples and it fully ran in the usual 3-4 hours in our server. Can you try the following:

  • In R, load the fraser dataset object and check its dimensions by executing
fds <- loadFraserDataSet('{root}/processed_data/aberrant_splicing/datasets/', name = '{DROP_GROUP}')
dim(fds)

What are the values of: Number of junctions and Number of splice sites?

  • Check the split and non split counts by executing
counts(fds, type = 'psi3') # for split counts
counts(fds, type = 'theta') # for non split counts

Then maybe check the total counts per sample by executing colSums on the previous matrices. Could it be that 1 sample has 0 counts?

  • Try with a group of 10 samples only. You can create a DROP_GROUP called e.g. small, add it to the config file and execute snakemake --cores X aberrantSplicing --rerun-triggers mtime

vyepez88 avatar Jul 06 '23 07:07 vyepez88

Hi @vyepez88 , Thanks for getting back, so:

> fds <- loadFraserDataSet(dir=workingDir, name=dataset)
> dim(fds)
[1] 167512    314

> counts(fds, type = 'psi3')
<167512 x 314> matrix of class DelayedMatrix and type "integer":
          X Y ... Z
     [1,]                 0                 3   .                 0
     [2,]                 0                 0   .                 0
     [3,]                 0                 0   .                 0
     [4,]                 0                 2   .                 0
     [5,]                 0                 0   .                 0
      ...                 .                 .   .                 .
[167508,]                 2                17   .                 7
[167509,]                 9                 0   .                11
[167510,]                 9                19   .                21
[167511,]                 7                 6   .                13
[167512,]                19                 3   .                26

> counts(fds, type = 'theta')
<289357 x 314> matrix of class DelayedMatrix and type "integer":
          X Y ... Z
     [1,]                 2                 0   .                 1
     [2,]                 0                 4   .                 0
     [3,]                 0                 0   .                 0
     [4,]                 0                 0   .                 0
     [5,]                 0                 1   .                 0
      ...                 .                 .   .                 .
[289353,]                16                 8   .                 8
[289354,]                 4                 4   .                11
[289355,]                 0                 0   .                 2
[289356,]                 8                21   .                15
[289357,]                26                 9   .                10

> summary(colSums(counts(fds, type = 'psi3')))
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
 5911485 12360005 14981966 15364413 17385947 36524542 

> summary(colSums(counts(fds, type = 'theta')))
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
 3059872  5146628  6048280  6228099  6981944 13637833 

> summary(rowSums(counts(fds, type = 'psi3')))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
     20    2205    9035   28800   27776 9691762 

> summary(rowSums(counts(fds, type = 'theta')))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      0    1065    2723    6759    6575 6062816 

> length(which(rowSums(counts(fds, type = 'theta')) == 0))
[1] 151

So there are no zero counts in the columns (samples) for split and non-split, but there are 0 counts in 151 rows (junctions) for the for non-split counts (theta). Would this be a problem? How best to remove these rows of 0 counts?

Just attempting now with the ten sample group - will let you know.

Many thanks.

chrisodhams avatar Jul 06 '23 09:07 chrisodhams

Hi Chris, so all looks good, the number of junctions, splice sites and reads. It is fine if there are rows with all 0 non-split reads in splice sites, that means that for that splice site, all reads are spliced. Let me know how it goes for the 10 sample group. Btw, you were able to run the demo, right?

vyepez88 avatar Jul 06 '23 09:07 vyepez88

Hi Chris, how did it go with the 10 samples?

vyepez88 avatar Jul 10 '23 07:07 vyepez88

Hi @vyepez88,

Sorry I was still waiting for confirmation of jobs to complete.

I ran for a subset of 39 samples as a test and it still did not complete the hyper optimization step within ~48hours (I can try with 10 samples but I think it will be the same story).

Using the code within the DROP pipeline to set the value of q, with a sample set of 39, the values of q are:

> unique(round(exp(seq(log(2),log(6.5),length.out = 6))))
[1] 2 3 4 5 6

I manually set q as 4 and continued using the FRASER R package manually, and all the subsequent steps run successfully (fit, calculateZscore, calculatePvalues, etc). These all completed in the time frame expected.

It's very hard to debug what is going on here as there are no temporary outputs/messages but however I try with the sample size and memory/core allocation - it just does not pass the hyper optimization step.

chrisodhams avatar Jul 10 '23 08:07 chrisodhams

Good that at least it worked after setting the q. Can you try to run the demo?

vyepez88 avatar Jul 10 '23 08:07 vyepez88

Just setting that up now. I'm sure this ran successfully when we initially installed DROP but I'll run again just to confirm!

chrisodhams avatar Jul 10 '23 08:07 chrisodhams

Hi @vyepez88 , This might take some time - I'm coming across different errors now and trying to debug. Just to say that a collaborator who is working on the same dataset in the same environment apparently got the aberrant splicing module working. The only difference I see is in the config is keepNonStandardChrs: false; whereas I have it set to true due to https://github.com/gagneurlab/drop/issues/454. I would be surprised if this is affecting anything however.

chrisodhams avatar Jul 11 '23 08:07 chrisodhams

Hi @chrisodhams, do you have any updates on this? I think you were able to successfully run it, or?

vyepez88 avatar Nov 15 '23 08:11 vyepez88

Hi @vyepez88 - haven't had time to check i'm afraid - will hopefully be in the new year. I was blocked by https://github.com/gagneurlab/drop/issues/489 for a while - but this is apparently resolved. Will check ASAP. Cheers,

chrisodhams avatar Nov 16 '23 12:11 chrisodhams