open-grid-emissions icon indicating copy to clipboard operation
open-grid-emissions copied to clipboard

Improve speed of cvxpy for BaDataCvxCleaner

Open grgmiller opened this issue 2 years ago • 0 comments

On my machine, when running the physics-based data cleaning for the EIA-930 data, the step "Running BaDataCvxCleaner for 13223 rows" takes over an hour and 40 minutes to solve (running 4 cores at 100% at 3.29 GHz). @gailin-p when you run this step, it sounds like it generally takes less time than this, and I'm wondering what we can do to improve the performance of the solver on any machine.

Here is the full output of the physics reconciliation step:

Running physics-based data cleaning
2022-06-25 20:13:03,830 - scraper - INFO - Basic data cleaning
2022-06-25 20:13:05,655 - load - WARNING - Inconsistent columns: set(NG_cols) != set(ID_cols2)
2022-06-25 20:13:05,655 - clean - INFO - Running BaDataBasicCleaner
2022-06-25 20:13:05,657 - clean - INFO - Adding demand columns for 0 bas
2022-06-25 20:13:05,658 - clean - INFO - Adding demand, generation and TI columns for 8 foreign bas
2022-06-25 20:13:07,605 - clean - INFO - Reinitializing fields
2022-06-25 20:13:07,622 - clean - INFO - Basic cleaning took 1.97 seconds
2022-06-25 20:13:16,473 - scraper - INFO - Rolling window data cleaning
2022-06-25 20:13:18,409 - clean - INFO - Running BaDataRollingCleaner (2 runs)
2022-06-25 20:14:09,513 - clean - INFO - Rolling window cleaning took 51.10 seconds
2022-06-25 20:14:37,514 - scraper - INFO - Optimization-based cleaning with src data: post 2018-07-01 00:00:00+00:00
2022-06-25 20:14:37,619 - clean - INFO - Running BaDataCvxCleaner for 13223 rows
2022-06-25 21:54:59,141 - clean - INFO - Checking BAs...
2022-06-25 21:54:59,609 - clean - INFO - Execution took 6021.99 seconds
2022-06-25 21:55:49,354 - scraper - INFO - gridemissions.workflows.make_dataset took 6165.52 seconds

Some questions to follow up on:

  • Is this a simple linear program, or are we solving a MILP?
  • What is the default solver being used, and is there a faster open-source option available?
  • Are there solver options that we could specify to speed this up? e.g. if this is a MILP, can we specify a larger mipgap?
  • Is this step doing the cleaning on all data back to 2018-07-01, or is it only cleaning the data for the year specified?

grgmiller avatar Jun 26 '22 04:06 grgmiller