pudl icon indicating copy to clipboard operation
pudl copied to clipboard

Train HistGBDT model to predict fuel price residuals

Open zaneselvans opened this issue 1 year ago • 4 comments

Rather than trying to predict fuel prices directly, which have regional variability we can't capture in training data, create a model that predicts the expected difference between an average fuel price, and the delivered fuel price, given the fuel type, time, location, plant attributes, fuel supplier, etc. Use the EIA API aggregate fuel prices integrated into the PUDL DB in #1765.

These predicted residual values can then be used to modulate the aggregated fuel prices that are available from the EIA API data, even in regions where there are no individual fuel price records available.

Generally speaking, the HistGBDT models developed under #1708 / #1714 should be a helpful starting place.

Questions

  • Should it try to predict an absolute difference (in $/mmbtu) or a fractional/percentage difference?
  • Should the same model be used to predict the residuals relative to all the aggregated prices? Or should there be a different model trained for each different aggregation? The value of the loss function will be different for the different aggregations so it seems like multiple models will probably be required, given that we won't have all aggregates available for comparison with each fuel delivery record.

zaneselvans avatar Jul 18 '22 23:07 zaneselvans

The EIA 923 documentation notes a few discontinuities to be aware of. Based on the notes below, we may want to provide estimates only for 2013-present.

  • data prior to 2002 was collected via FERC Form 423, which only covered utility owned plants and only for steam turbines and combined cycle units. "a significant number of plants either did not submit fossil fuel receipts data or submitted only a portion of their fossil fuel receipts."
  • 2002 through 2007: collection moved to EIA‐423 and now included both utility and non-utility plants. Data was collected monthly.
  • 2008 through 2012: the Form EIA‐423 was superseded by Schedule 2 of the Form EIA‐923
    • "Beginning with 2008 data, only a sample of the respondents report monthly, with the remainder reporting [annual totals]... monthly fuel receipts values for the annual surveys were imputed via regression." (changed in 2013)
    • "If the reported data appear to be in error and the data issue cannot be resolved by follow up contact ... a regression methodology is used to impute for the facility." (removed in 2013 for FRC only)
    • because FERC's exemptions were removed, "receipts data from 2008 and later cannot be directly compared to previous years’ data for the regulated sector. Furthermore, there may be a notable increase in fuel receipts beginning with January 2008 data."
  • 2013-present:
    • only for fuel receipts data, they stopped using regression imputation for erroneous values
    • regression imputation of monthly values was replaced by end-of-year reporting of monthly-resolution data
    • for plants primarily fueled by natural gas, petroleum coke, distillate fuel oil, and residual fuel oil, the reporting threshold was changed from 50 megawatts to 200 megawatts (goodbye peakers?). The threshold for coal plants remained at 50 megawatts.
    • The requirement to report self‐ produced and minor fuels, i.e., blast furnace gas, other manufactured gases, kerosene, jet fuel, propane, and waste oils was eliminated.

There are a handful of other changes that weren't relevant to FRC but are good context for other analyses, like changes to how EIA allocates fuel between thermal and electrical uses for CHP plants.

TrentonBush avatar Jul 19 '22 07:07 TrentonBush

Well, we should definitely add these reporting change notes to the fuel_receipts_costs_eia923 table metadata and/or our EIA-923 data source page.

From chatting with @cmgosnell my understanding is that the RMI Hub team will want some kind of complete fuel price estimations going back as far as possible (they have 2000 as the emissions baseline they're trying to compare against), and someone else will end up making the estimates for missing prices if we don't (and then maybe building on top of those estimates, and eventually needing them to be integrated into other stuff...) so I suspect we should do our best to fill in the missing values with reasonable estimates, and document the caveats based on all the different reporting regimes over time. We haven't done it yet but I think will eventually want to pull in the EIA-423 fuel receipts data.

Maybe @arengel or @jrea-rmi have thoughts to add here from their use cases.

zaneselvans avatar Jul 19 '22 15:07 zaneselvans

Currently the Hub uses FERC 1 fuel cost data because we only cover FERC 1 respondents and it means we don't have to deal with FERC-EIA matching to assemble these datasets, its all just FERC 1 cost data.

However, another piece of work we've been pursuing recently does require more comprehensive fuel cost data and for that we created a process for estimating missing data.

Coal fuel cost source priority

  1. Use reported costs if available
  2. Estimate coal costs using from the following regression parameters
    1. Distance to coal mine
    2. Dummy for contract term (0 if <= 36 months, 1 if > 36 months)
    3. Dummy for type of contract (1 if spot purchase, 0 if contract)
    4. Dummy for each mine (by mine_msha_id)
  3. Estimate fuel costs based on estimates from nearby plants with an expanding radius

Oil/NG fuel cost source priority

  1. Use reported costs if available
  2. Estimate fuel costs based on estimates from nearby plants with an expanding radius

It has some common ideas (I think) to what is proposed here but not being familiar with HistGBDTs (or actually doing the this analysis), I am probably not the best person to opine on that. I've reached out to other members of the team who have been thinking about this more for their thoughts on the proposal here, our use cases, etc.

arengel avatar Jul 19 '22 17:07 arengel

@arengel if you want to see the messing around we've done with the gradient boosted decision tree regression, you can check out PR #1696 and its associated branch, and some of the other closed issues under #1708. The relevant code is under pudl.analysis.estimate_fuel_prices.py (or something like that).

I enjoyed the StatQuest videos explaining how the method works in general:

It makes it ridiculously easy to add many different categorical or continuous variables and extract predictive information from them, in a much more dynamic and generalized way than manually selecting thresholds allows.

It did a great job of predicting fuel prices where there was lots of analogous data, even when it wasn't allowed to train on any data from the plants or states it was going to have to predict, but we discovered that most of the missing fuel prices were very clustered, and entire regions (esp. the NE US) were totally lacking in data, and had unique market characteristics that couldn't be learned from other regions. Which resulted in the suggestion that we try to learn how to predict deviations from the aggregate fuel prices, and use those predictions to modulate the published aggregates in regions with lots of redacted prices.

zaneselvans avatar Jul 19 '22 18:07 zaneselvans