glow icon indicating copy to clipboard operation
glow copied to clipboard

Slow Databricks GLOW implementation

Open bbsun92 opened this issue 4 years ago • 5 comments

I am trying to get GLOW WGR to run using the example step by step but with own data (~300k variants in ~350k individuals, ~ 450 binary phenotypes).

After writing the delta files and reading from that, the reducer.fit_trasnform function takes around 3 days: myreducer = RidgeReducer() model_df = myreducer.fit_transform(block_df, label_df, sample_blocks, covariate_df)

model_df.cache()

estimator = LogisticRegression() mymodel_df, cv_df = estimator.fit(model_df, label_df, sample_blocks, covariate_df)

took more than 8 days and failed with: FatalError: SparkException: Job aborted due to stage failure: Task 89 in stage 23.0 failed 4 times, most recent failure: Lost task 89.3 in stage 23.0 (TID 116943, 10.248.234.190, executor 87): ExecutorLostFailure (executor 87 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues.

I was using the 7.4 Genomics runtime on large clusters of i3.8xlarge driver and 4-22 workers.

This sounds way too long given the reported speeds in the bioarvix paper?

Help would be great on this. Thanks

bbsun92 avatar Jan 13 '21 18:01 bbsun92

@bbsun92 It looks like you're running into memory pressure because of the large number of phenotypes. In our tests, we typically use batches of 15-20 phenotypes for each run. Could you try that?

Btw, what format is the input data? If you have not done so already, it's best to write the block matrix to a Delta table.

henrydavidge avatar Jan 13 '21 18:01 henrydavidge

Thanks

@bbsun92 It looks like you're running into memory pressure because of the large number of phenotypes. In our tests, we typically use batches of 15-20 phenotypes for each run. Could you try that?

Btw, what format is the input data? If you have not done so already, it's best to write the block matrix to a Delta table.

Thanks, I think tried the same with smaller number of phenotypes and still very slow. The input data is the genetic data in Delta table format as shown in the example code on GLOW. I don't think the intermediate block matrix is written out as another set of Delta table. Also do you have a benchmark of how long stage 1 should take? I gather either way shouldn't be taking 10 days given the large number of nodes being used?

bbsun92 avatar Jan 13 '21 18:01 bbsun92

Hi @bbsun92! I performed similar scale testing following PR #282 (this is included in Genomics Runtime 7.4). In my testing setup, I used the following:

  • Cluster with a r5d.12xlarge driver and 6 r5d.12xlarge workers
  • Dataset with 500K samples, 500K variants, 25 phenotypes, 16 covariates, and 5 alphas

The runtime was 15.85 minutes for stage 1 (writing the blocked data to a Delta table) and 85.8 minutes for stage 2.

karenfeng avatar Jan 29 '21 18:01 karenfeng

Thanks Karen. Do you have a notebooks/list of exact steps that was taken so I can try and reproduce? Especially for stage one. Thanks!

bbsun92 avatar Feb 19 '21 13:02 bbsun92

hey @bbsun92 this repo contains all the latest notebooks under glow/docs/source/_static/notebooks

williambrandler avatar Feb 22 '21 19:02 williambrandler