pudl icon indicating copy to clipboard operation
pudl copied to clipboard

Estimate redacted EIA 923 fuel prices

Open TrentonBush opened this issue 2 years ago • 0 comments

Motivation

This project began with our desire to remove our external dependency on the EIA API (see epic #1491 and its issue #1343), but expanded into this Epic after we realized we could improve substantially on EIA's methodology.

Why impute? About a third of the fuel cost data are missing. This data can and has been used by advocates to identify plants with high fuel costs (particularly coal plants) to target for early retirement campaigns. The more complete and accurate this data is, the more opportunities for such action we can support.

Issues:

Systematic Biases

  • It turns out that, in fact, the data is systematically biased. 47% of plants have complete data, 47% have no data, only 6% have partial data.
  • In general IPPs (merchant generators) redact all their fuel prices, and these generators are concentrated in competitive wholesale markets, especially the Northeastern US, where there are essentially no reported fuel prices.
  • In addition, the Northeast has a unique seasonality in its natural gas prices, which would be impossible to infer by sampling data elsewhere in the country.
  • This means we have to use the aggregate data from the EIA API to accurately estimate prices nationwide.

Scope

A major organizational concern is not necessarily whether we should do this project but rather how much effort to devote to it. Improving model accuracy is a potentially endless spiral of diminishing returns. Is there an accuracy threshold we can call good enough? Is there a certain accuracy improvement per time threshold that we use as a stopping point? We need to define this before getting sucked into the endless labyrinth of interesting technical problems.

Requirements

  • Produce a fuel price estimate for every delivery in the fuel_receipts_costs_eia923 table.
  • Do not rely on the EIA API in the pipeline, due to reliability issues in CI / testing and user setup difficulties.
  • Estimates should be at least as accurate as the coarse aggregations that we've used historically.
  • Estimates should be consistent with the spatial and temporal variation we see in the aggregated data (e.g. the seasonal variability that's observed in the aggregated natural gas prices from the Northeastern US).
  • The model should be performant enough to run as part of our nightly builds (less than maybe 10 minutes of run time)
  • We should avoid new software dependencies if possible

Issues for July 18-31 sprint

  • [x] #1709
  • [x] #1710
  • [x] #1711
  • [x] #1712
  • [x] #1714
  • [x] #1748
  • [x] #1720
  • [x] #1762
  • [x] #1763
  • [ ] #1764
  • [ ] #1765

Issues for Aug 1-14 sprint

  • [ ] #1766
  • [ ] #1767

TrentonBush avatar Jun 22 '22 22:06 TrentonBush