pudl icon indicating copy to clipboard operation
pudl copied to clipboard

Estimate monthly plant-level fuel prices w/o using the EIA API

Open zaneselvans opened this issue 2 years ago • 48 comments

Instead of using the EIA API to pull monthly average fuel costs by state and fuel when individual fuel deliveries have their costs redacted in the fuel_receipts_costs_eia923 table, calculate it for ourselves.

Motivation

This change will address several issues:

  • The EIA API is missing a fair amount of the data anyway. Sometimes whole state-months are missing. It also only contains data for coarse fuel categories (coal, petroleum, natural gas) rather than the specific fuel types.
  • Relying on the API means asking users to register for an API key and manage environment variables. This is a barrier for many of our less technical users.
  • Whenever something goes wrong with the API, our CI tests fail, and we can't work with this data locally. Over time this has been happening more frequently. HTML gets returned instead of JSON, or the network is down.
  • EIA is discontinuing the v1 API in November, 2022, so our current setup will stop working anyway.
  • There's a lot of information in the fuel_receipts_costs_eia923 table, and related to the plants and mines and suppliers involved. It should be possible to do a fairly good estimation of the fuel prices from scratch given all that context.

Approach

  • Estimate fuel prices using a variety of aggregations and use them to fill missing values.
  • Start with the most granular / accurate and progressively apply less specific estimates until everything is filled in.
  • Tag each record indicating which estimation was used to fill it in.
  • Pre-calculate all of the aggregations so that we can look at how they compare with actual values first.
  • Add each of these aggregations to the original FRC dataframe for plotting.
  • We should also include the EIA API values for comparison / constraint based on the redacted values.
  • Looking at the EIA API, only PEL, PC, COW, and NG really have values for $/MMBTU at census region and state level.
  • Seems like the very granular fuel types only have prices for US Total, at least at the monthly level.
  • Use median values of the fuel prices in $/MMBTU
  • Maybe calculate a weighted median? Want typical MMBTU, not typical delivery.
  • @gschivley and Neha suggested using both spatial and temporal interpolation -- averaging prices from the adjacent states, and filling in gaps in the monthly time series when possible.
  • We could also use a low-effort, but powerful estimator like XGBoost or a random forest to try and incorporate much more information, without designing something bespoke from scratch.
  • We should be able to benchmark these calculations against the data from the API or the specific information reported in the FRC table by doing some random knockouts to see how well we can recreate the reported values.

Choosing Aggregations

  • How do we decide how to prioritize aggregations?
  • Coal prices don't vary much month to month, aggregating annually would have little impact.
  • Gas & Petroluem prices can vary dramatically month to month, so aggregating across time is bad.
  • Petroleum fuel prices are highly correlated nationwide, so aggregating geographically has little impact.

Intuitive Aggregation Priorities

  1. Most precise: ["state", "energy_source_code", "report_date"]
  2. Annual aggregation (coal): ["state", "energy_source_code", "report_year"]
  3. Regional aggregation (petroleum): ["census_region", "energy_source_code", "report_date"]
  4. Fuel type aggregation: ["state", "fuel_group_code", "report_date"]
  5. Both regional and fuel type aggregation: ["census_region", "fuel_group_code", "report_date"]
  6. Annual, regional, and fuel type aggregations: ["census_region", "fuel_group_code", "report_year"]

Questions:

  • Should we use a MMBTU weighted median rather than delivery weighted median?
  • How should we identify outlier values in the fuel prices which should be replaced? Some are totally whacked.

Other Potential Refinements

  • Automatically fill using aggregations in order of increasing dispersion of the error distribution (e.g. IQR) rather than hard-coding the order based on intuition and eyeballing it.
  • Calculate the dispersion of the error distribution on an annual basis, rather than across the entire timeline, in case the temporal, fuel type & spatial correlations change over time.

Remaining tasks:

  • [x] Always plant_state into the fuel_receipts_costs_eia923 output table all the time.
  • [x] Add the census regions to state mappings into the metadata enums / constants.
  • [x] Replace the existing roll & fill method in the fuel_receipts_costs_eia923 output routine.
  • [x] Update tests to work with the new version of frc_eia923
  • [x] Remove API_KEY_EIA infrastructure from everywhere in the code, so we aren't unknowingly relying on it.
  • [x] Make filling in missing fuel prices the default behavior
  • [x] Fix the filled_by labeling, which is now showing all filled values having national_fgc_year which is the last aggregation.
  • [x] Remove fuel_group_code from the fuel_receipts_costs_eia923 table and add it to the energy_sources_eia coding table, and add it back into the output function.
  • [x] Understand why these changes are apparently affecting ouput row counts
  • [x] Pull the fuel price filling out into its own separate function
  • [x] Understand why merge_date() is removing ~10k frc_eia923 records.
  • [x] Implement weighted median function to use in filling & identifying outliers
  • [x] Add weighted median unit tests
  • [x] Identify outlying fuel prices using modified z-score with MMBTU weighted median
  • [ ] Have @cmgosnell look for weirdness in the results of a new MCOE calculation in an RMI context.
  • [ ] Update release notes
  • [ ] After merging into main remove API_KEY_EIA from the GitHub secrets.

zaneselvans avatar Nov 10 '21 22:11 zaneselvans

Hey @katherinelamb I'd like to talk to you about how we might use some ML / regressions to fill in these missing fuel costs if that's an issue you might like to take on.

zaneselvans avatar Mar 16 '22 19:03 zaneselvans

This definitely sounds interesting and doable.

Potential data for features

  • who is selling the fuel, who is buying it, what kind of fuel, how much, what kind of contract, prices in neighboring states, historical timeseries

It's probably best to start by trying to reduce the problem to just be either spatial or temporal and then try some regressions and common modeling frameworks. Seems like it's harder but still doable to model both spatially and temporally.
A novel framework for spatio-temporal prediction

katie-lamb avatar Mar 17 '22 16:03 katie-lamb

While we do have the lat/lon location for all the plants, for the "spatial" dimension I was imagining something like looking at the prices reported in adjacent states if they're available, since I imagine that there are generally regional price trends. I think this is what Aranya / Greg did. E.g. if a given state-month-fuel combination is missing a price, fill it in with the average of the reported price for adjacent states for that fuel and month, and then that filled-in state-month-fuel table could be used to estimate the cost per unit of fuel for any records in the fuel_receipts_costs_eia923 table that have redacted values. But maybe that's more vague and complex than it needs to be?

I guess in an ideal world we'd be able to make predictions / estimations for every record with a missing value, given all of the information available in a very generic way, but that seems like asking a lot. There would be a lot of categorical variables (state, buyer, seller, fuel type, spot market vs. contract purchase) numerical values (quantity of fuel purchased, heat content per unit of fuel, cost of fuel for that month in all adjacent states, or the region, or the country as a whole). And really you could bring in any number of utility, plant, and generator attributes if they were helpful.

zaneselvans avatar Mar 17 '22 22:03 zaneselvans

I looked into this today and found some things.

  • The API information we're using to fill in fuel prices right now is coarse in terms of fuel type. It only uses 3 high level categories, one each for all coal (COW), all petroleum derived fuels (PEL), and all natural gas (NG).
  • However, the FRC table has the much more detailed energy_source_code column that we could use for missing price estimation if we wanted to.
  • We map these three high level fuel categories to our own fuel_type_code_pudl (coal, oil, gas)... but we lump together different energy_source_code values in our categories than EIA does, which has to be introducing some misalignment between the filled in prices we're using, and what we'll get if we calculate monthly per-state, per-fuel costs.
  • Even so, looking just at these imperfectly mapped high level categories, and dropping all the NA before calculating the average fuel prices, the correlation is very high between the average coal/oil/gas price per unit as calculated from the FRC, and the value we get from the API.
  • I suspect we chose to use these high level categories because it was more likely to give us a value to fill in when it was redacted from the FRC table.
  • However we didn't understand the hierarchies of categories that the EIA was using at the time. Our categorization of coal, oil, and gas is more like solid, liquid, gas, while theirs is more indicative of the original resource the fuel was derived from (e.g. petcoke is solid so we called it coal, but it's derived from oil, so they call it petroleum).
  • Now that we have these different fuel attributes explicitly broken out in the energy_sources_eia coding table, we could choose to organize things differently if we wanted to, or potentially just use their categorizations instead of rolling our own.

API vs. frc_eia923

Here's how the fuel API prices compare to an aggregation from the reported data using the (slightly different) high level categories. Each point corresponds to a particular (report_date, state, fuel_type) combo.

Coal

image

Oil

image

Gas

(note the different X and Y scales because of some outliers) image

What to do?

  • Right now we could fill in missing values using the coal/oil/gas aggregated state prices. This would yield slightly different results than we're currently getting because those categories correspond to slightly different sets of records in the EIA and PUDL universes right now. But I don't think the different results would obviously be less correct than the current ones.
  • We could reform the fuel_type_code_pudl so that it does correspond to the EIA categorizations, and that ought to bring the two calculations closer together.
  • I think the "right" thing to do is probably to use a regression (random forest?) to try and estimate prices for each combination of (report_date, state, energy_source_code) rather than fuel_type_code_pudl, using whatever information is available in the FRC table. Or I guess we don't necessarily have to restrict ourselves to just looking at state-level estimates. We could try and estimate individual plant level prices and see how it does.

zaneselvans avatar Jun 14 '22 05:06 zaneselvans

I'm curious what @cmgosnell thinks about the simpler options above as an interim change.

zaneselvans avatar Jun 14 '22 13:06 zaneselvans

For context on the original choice to use the larger categories, iirc we went this the fuel_type_code_pudl so that we'd get higher coverage. iirc the original way we were doing it (i.e. not what is currently in dev) there was some (unnecessary) limiting factor, so we went with the broad fuel types to get broader coverage.

I agree generally that we can and probably should use the more granular energy_source_code. I assume that whatever we do won't align perfectly with EIA (in part because I could never find their methodology for how they generated these state-level values). I would personally not work with option 2 above (reform fuel_type_code_pudl). The desire imo is something defensible, not something that perfectly matches EIA.

cmgosnell avatar Jun 14 '22 13:06 cmgosnell

Do you think it makes sense to swap in our own average fuel prices based on the fuel_type_code_pudl aggregations for the time being, if it gives us coverage that's similar to what we're getting from the EIA API (which IIRC, wasn't great)?

Another thought: does it make more sense to try and estimate the price per unit or the price per MMBTU? (where unit is ton, barrel, or mcf). I would think that within these big categories there would be less dispersion on a $/MMBTU basis, since what you're really buying is heat content, not mass (i.e. BIT costs much more than LIG per ton, but they're closer to each other per MMBTU). I think we have the option of doing either one.

zaneselvans avatar Jun 14 '22 14:06 zaneselvans

I definitely think using fuel_type_code_pudl for the time being makes sense, but i assume its similar math to try it with energy_source_code, so I'd personally check the coverage of the latter before defaulting to fuel_type_code_pudl.

I am less familiar with the unit vs btu distinction, but I think you are on to something! If all the info is there (which I believe it is in the frc table), then yeah go for it... but again i'd check the energy_source_code first. iirc the original coverage of the API fuel source was artificially limited because of something I did that was silly and was later fixed. My gut tells me that we'd get pretty good overall coverage using just energy_source_code with our own data.

I would almost use state-level energy_source_code prices and then try nearby states or just a national average.

cmgosnell avatar Jun 14 '22 14:06 cmgosnell

Okay, I'll try the simplistic (report_date, state, energy_source_code) and (report_date, state, fuel_type_code_pudl) options and see how the coverage compares to what we're currently doing.

Beyond that, rather than immediately jumping to our own bespoke estimator based on adjacent states, I'd like to try a few generic regression options from scikit-learn that can make use of a lot more features in the FRC table, and that can be trained and validated on the records in the table that do have fuel prices (which is most of them).

I guess we can also easily check whether the price per physical unit or the price per MMBTU is more consistent just by plotting the distributions of observed prices in the dataset per date / state / fuel. It seems like whichever variable has less dispersion would be the easier one to estimate accurately. But curious what @katie-lamb and @TrentonBush think with their stats background.

zaneselvans avatar Jun 14 '22 14:06 zaneselvans

Maybe @priyald17 or @joshdr83 would have suggestions on how best to approach this too.

zaneselvans avatar Jun 14 '22 16:06 zaneselvans

For context here's the fuel_receipts_costs_eia923 table on Datasette and the field we're trying to estimate is fuel_cost_per_mmbtu. There are a bunch of additional attributes associated with both the plant_id_eia and the mine_id_pudl including the lat/lon of the plants and the county FIPS code associated with many of the coal mines, that could also be used to construct features.

zaneselvans avatar Jun 14 '22 16:06 zaneselvans

To clarify the problem:

  • we want to impute redacted fuel costs
  • we have previously filled them with one of the values from the EIA API, but that is a pain for various operational reasons
  • those EIA values are mysterious (to us) calculations that we think are something like a weighted average price that includes:
    • all the data we have
    • AND the redacted data we don't have

Is that right?

I see two separate parts to this:

  1. How well can we model known prices given all the other info we have in PUDL?
  2. Given the EIA API has information we don't have, how can we use that to validate our imputations of unknown prices?

Relatedly, the first few scoping questions I'd ask are:

  • how much variation is there in fuel price at a given point in time and/or region? (what are the stakes)
  • how much data is redacted? Are those particularly important for any reason? (what are the stakes)
  • do we have reason to think redacted costs are systematically different from known costs? (is this method even valid)

Ideally we test the validity, but remember we do have the option to just "assume and caveat" if the stakes are low. Plus, if EIA has done their job, there won't be a good way to test it.

With respect to modelling, I'd let the "what are the stakes" question guide the degree of complexity (simple averages vs fancier models). I'm not familiar with how redaction works (are the same plants always redacted? Does it come and go? etc), but in principle I think we have everything we need to construct a well formulated ML solution, if that is warranted.

TrentonBush avatar Jun 14 '22 22:06 TrentonBush

I wrote down these questions that are pretty similar to what Trenton asks above:

  • Are there particular states that are missing more data than other states or is it a somewhat random dispersion of missing data?
    • If Nevada was missing a lot of data then there might not be enough recorded Nevada data to provide an accurate estimate based on state average. It also might not necessarily be a good idea to use California as a proxy for fuel costs just because they are geographically close.
  • Similarly, are there particular time series that are missing data? Are we missing more data from some fuel types than others? Is there a reason for this?

If the redacted data is kind of "one off" and there is still enough representative data, then doing a regression to estimate prices for each date, state, and fuel type could definitely work. But I also agree that we have all the tools and data needed to do something more comprehensive if it seems that the missing data can't be represented that simplistically.

Although I think I got a little lost at some point about the construction of fuel_type_code_pudl vs.energy_source_code and price per MMBTU vs. price per unit, I think it makes sense to do some feature selection with the attributes of the fuel receipts cost table + energy_source_code and attributes from related tables for each of these potential price attributes that we may try and estimate, instead of assuming that nearest states or something is the best estimator. The scores from this feature selection could also reveal something about what price attribute best to estimate. Maybe the k best features are similar for each estimated variable or maybe there's one variable that stands out as more easily modeled.

katie-lamb avatar Jun 14 '22 22:06 katie-lamb

  • The fuel_receipts_costs_eia923 table has about 560,000 records in it, and about 67% of them have reported fuel prices.
  • I'm not sure the monthly statewide average fuel costs do account for the redacted values, though that would be helpful.
  • There are cases in which no state-level average is available, and I'm not sure why.
  • Using the statewide averages and the coarse fuel types from the API, we get 91% coverage. using rolling averages only gets us up to 93% so the time gaps are pretty big.

The distribution of missing values is skewed towards the early years (2008-2011) before filling with the API:

image

After filling with the API it's pretty uniform across all the years:

image

zaneselvans avatar Jun 14 '22 23:06 zaneselvans

What do you think would be a good way to evaluate the dispersion of fuel prices? It seems like there are at least 3 dimensions to consider in sampling and estimating fuel prices:

  • Spatial: multi-state regions, single states, or individual plants.
  • Temporal: the entire price history, a single year, a single month, or some seasonally correlated periods.
  • Fuel Type: coarse fuel groups (coal, oil, gas) or individual energy_source_code values.

Using the more granular options (plants, months, energy_source_code) seems like it should give us lower dispersion results, but it'll also reduce the sample sizes pretty dramatically, reducing confidence in the number we get out. For example, here are some state-level distributions for a single year, and a single energy source code. I chose a mix of big states (CA, FL, NY, TX) medium states (CO, WY) and small states (ID, ME). NG and SUB are the most common fuel types.

Price distribution by state for NG (2020)

image

Price distribution by state for SUB (2020)

image

zaneselvans avatar Jun 15 '22 15:06 zaneselvans

Rather than just counting individual fuel deliveries up, would it make more sense to weight them by the total heat content of the fuel delivered?

zaneselvans avatar Jun 15 '22 15:06 zaneselvans

The following is ONLY for gas prices (filtering not shown)

  • monthly median gets you within about 15% of the true value
  • a state-month median gets you to within about 10% (+- $0.30ish) of the true value

I'm sure we could tighten that much further with a better model. I'm just not sure what our goals should be here without an application in mind. Maybe a reasonable next step would be to throw a low-effort, high-power model like XGBoost at it and see what kind of accuracy is possible.

# I didn't bother with any train-test split here so these error values should be considered best-case indicators
frc['global_median_fuel_price'] = frc['fuel_cost_per_mmbtu'].median()
frc['month_median_fuel_price'] = frc.groupby('report_date')['fuel_cost_per_mmbtu'].transform(np.median)
frc['state_month_median_fuel_price'] = frc.groupby(['state', 'report_date'])['fuel_cost_per_mmbtu'].transform(np.median)

error = frc.loc[:, [
                           'global_median_fuel_price',
                           'month_median_fuel_price',
                           'state_month_median_fuel_price',
                        ]
               ].sub(frc['fuel_cost_per_mmbtu'], axis=0)

pct_error = error.div(frc['fuel_cost_per_mmbtu'], axis=0)

image image image

TrentonBush avatar Jun 15 '22 22:06 TrentonBush

To first address the goal of removing the EIA API calls from the CI asap:

Do you have an accuracy metric on how well the most granular option (plants, months, energy_source_code) performs against the API results? Or how swapping in state level instead of plant affects accuracy? I think for now you could just sub in the median of whatever combo of the three dimensions gets the highest API accuracy. It seems like it might be state, month, energy code. Although @zaneselvans mentioned there are states with no data (I think you actually said a state-level average is not available), so does this necessitate multi-state aggregations?

After this quick fix, I think looking at what other features we could bring into a more powerful model would be useful.

katie-lamb avatar Jun 15 '22 23:06 katie-lamb

Are the API results our target or are the redacted values our target? I thought the API was basically just an outsourced imputation model. If it includes redacted data in its aggregates then it has value for validation. Is there anything more to the EIA API?

TrentonBush avatar Jun 15 '22 23:06 TrentonBush

I believe you're right and the API is just modeling these values. After talking to Zane, I think to get a quick fix for today/tomorrow the API results are our target since people currently using the table don't want the values to change significantly. But to develop a more accurate and comprehensive solution in the less short term, the redacted values are indeed the target.

katie-lamb avatar Jun 15 '22 23:06 katie-lamb

Oh I didn't know there was a timeline on this at all, much less tomorrow! Do what you gotta do

TrentonBush avatar Jun 15 '22 23:06 TrentonBush

LOL ya I don't think this discussion is really framed around a timeline and is in fact more long term ideating of modeling tactics. My comment was really just spurred by annoyance that the CI is unreliable and wanting something to stand in there while a better solution is developed.

katie-lamb avatar Jun 15 '22 23:06 katie-lamb

Hey sorry, just tuning in! I did a spatial interpolation for fuel prices to average down to the county level for new build estimates in this paper. Is it mostly the non ISO regions that are short of data?

joshdr83 avatar Jun 16 '22 13:06 joshdr83

  • I used the space, time, and fuel category aggregations I mentioned above, and calculated the median value for each of them for every row in the dataframe, and added those columns to the dataframe for easy comparison.
  • I calculated the relative error (reported price - estimate prices) / reported price for all of them, to get a sense of how far off they were from the reported values they were built out of.
  • Petroleum and natural gas based fuels vary a lot month to month, meaning aggregating across time introduces errors.
  • Petroleum fuel prices are highly correlated regionally and even nationwide, so aggregating them across space is okay.
  • Coal prices vary by state and region and type of coal, but not much over time, so aggregating up to annual prices isn't so bad if we have to.
  • I filled in the missing fuel prices iteratively, stating with the most specific estimates, and working my way down to the least specific estimates, sometimes only filling in for a given fuel group (annual prices for coal, regional prices for petroleum).
  • It also adds a filled_by column indicating what method was used to fill it in if it was missing.
  • I used national averages to fill in the last ~2% of the records that regional estimates still couldn't get.
  • Overall this process only takes about 5 seconds on my laptop and gets us a fuel price estimate for every record.
from collections import OrderedDict
import scipy.stats

cols = [
    "report_date",
    "plant_id_eia",
    "state",
    "energy_source_code",
    "fuel_cost_per_mmbtu",
    "fuel_group_code",
    "fuel_consumed_mmbtu", # required for weighting
]

CENSUS_REGIONS = {
    "ENC": ["WI", "MI", "IL", "IN", "OH"],
    "ESC": ["KY", "TN", "MS", "AL"],
    "MAT": ["NY", "PA", "NJ"],
    "MTN": ["MT", "ID", "WY", "NV", "UT", "CO", "AZ", "NM"],
    "NEW": ["ME", "VT", "NH", "MA", "CT", "RI"],
    "PCC": ["CA", "OR", "WA"],
    "PCN": ["AK", "HI"],
    "SAT": ["WV", "MD", "DE", "DC", "VA", "NC", "SC", "GA", "FL"],
    "WNC": ["ND", "SD", "MN", "IA", "NE", "KS", "MO"],
    "WSC": ["OK", "AR", "TX", "LA"],
}

CENSUS_REGION_MAPPING = {state: region for region, states in CENSUS_REGIONS.items() for state in states}

aggs = OrderedDict({
    # The most precise estimator we have right now
    "state_esc_month": {
        "agg_cols": ["state", "energy_source_code", "report_date"],
        "fuel_group_code": None,
    },
    # Good for coal, since price varies much more with location than time
    "state_esc_year": {
        "agg_cols": ["state", "energy_source_code", "report_year"],
        "fuel_group_code": "coal"
    },
    # Good for oil products, because prices are consistent geographically
    "region_esc_month": {
        "agg_cols": ["census_region", "energy_source_code", "report_date"],
        "fuel_group_code": "petroleum",
    },
    # Less fuel specificity, but still precise date and location
    "state_fgc_month": {
        "agg_cols": ["state", "fuel_group_code", "report_date"],
        "fuel_group_code": None,
    },
    # Less location and fuel specificity
    "region_fgc_month": {
        "agg_cols": ["census_region", "fuel_group_code", "report_date"],
        "fuel_group_code": None,
    },
    "region_fgc_year": {
        "agg_cols": ["census_region", "fuel_group_code", "report_year"],
        "fuel_group_code": None,
    },
    
    "national_esc_month":{
        "agg_cols":  ["energy_source_code", "report_date"],
        "fuel_group_code": None,
    },
    "national_fgc_month": {
        "agg_cols": ["fuel_group_code", "report_date"],
        "fuel_group_code": None,
    },
    "national_fgc_year": {
        "agg_cols": ["fuel_group_code", "report_year"],
        "fuel_group_code": None,
    },
})

frc = (
    pudl_out.frc_eia923()
    .loc[:, cols]
    .assign(
        report_year=lambda x: x.report_date.dt.year,
        census_region=lambda x: x.state.map(CENSUS_REGION_MAPPING),
        fuel_cost_per_mmbtu=lambda x: x.fuel_cost_per_mmbtu.replace(0.0, np.nan),
        filled=lambda x: x.fuel_cost_per_mmbtu,
    )
)

frc["filled_by"] = pd.NA
frc.loc[frc.fuel_cost_per_mmbtu.notna(), ["filled_by"]] = "original"

for agg in aggs:
    agg_cols = aggs[agg]["agg_cols"]
    fgp = aggs[agg]["fuel_group_code"]
    
    logger.info(f"Aggregating fuel prices by {agg}")
    frc[agg] = (
        frc.groupby(agg_cols)["fuel_cost_per_mmbtu"]
        .transform("median") # Use weighted median to avoid right-skew
    )

    logger.info(f"Calculating error for {agg}")
    frc[agg+"_err"] = (frc[agg] - frc.fuel_cost_per_mmbtu) / frc.fuel_cost_per_mmbtu
    
    logger.info(f"Filling fgp with {agg}")
    if fgp:
        logger.info(f"Filling only {fgp}")
    mask = (
        (frc.filled.isna())
        & (frc[agg].notna())
        & (frc.fuel_group_code == fgp if fgp else True)
    )
    logger.info(f"Mask selects {sum(mask)} values.")
    frc.loc[mask, "filled_by"] = agg
    frc.loc[mask, "filled"] = frc.loc[mask, agg]
frc.filled_by.value_counts()
original              373841
state_esc_month       127336
region_fgc_month       30420
national_esc_month     10091
region_esc_month        9823
state_fgc_month         6836
national_fgc_month      1079
region_fgc_year          702
state_esc_year           232
Name: filled_by, dtype: int64

Error distributions by fuel group

Generally these lined up with my intuition about which kinds of aggregation would introduce more error (as represented by the interquartile range of the error distribution). The ordering is different for the different fuel types generally though, and we could just automatically apply all of them on a per-fuel basis, in order of increasing IQR. The impact would be pretty marginal, but it would also do the right thing if future data looked different. We could also break out each year separately and apply that process, so that if the nature of the fuel price correlations changes over time, it would capture and and do the right thing each year (e.g. if lots of gas pipelines get built, maybe regional variation in gas prices would go away, and in those later years averaging across large areas would have much less impact than it does now...).

matplotlib.rcParams["figure.figsize"] = (10, 9)
for fgc in ["coal", "petroleum", "natural_gas"]:
    fig, axes = plt.subplots(ncols=3, nrows=3, sharex=True, sharey=True)
    for agg, ax in zip(aggs, axes.flatten()):
        logger.info(f"Plotting {agg} error")
        error = frc.loc[frc.fuel_group_code==fgc, agg+"_err"]
        ax.hist(error[(error != 0.0)], range=(-1,1), bins=200)
        iqr = scipy.stats.iqr(error, nan_policy="omit")
        ax.set_title(f"{agg} (iqr={iqr:0.3})")
        ax.axvline(x=0, lw=1, ls="--", color="black")
    
    fig.suptitle(f"Fuel Group: {fgc}", fontsize=16)
    fig.tight_layout()
    plt.show();

image image image

Filled vs. Reported Distributions

To see how different the filled values are from the reported values, and whether there's an obvious bias being introduced, I made the plots below. It looks like maybe there's a sliiiiight rightward skew in the filled values for coal and natural gas, but not sure it's significant.

fig, axes = plt.subplots(ncols=3)#, sharex=True, sharey=True)
fuel_group_codes = ["coal", "petroleum", "natural_gas"]
matplotlib.rcParams["figure.figsize"] = (10, 4)
for fgc, ax in zip(fuel_group_codes, axes):
    ax.hist(
        frc.loc[frc.fuel_group_code == fgc, "fuel_cost_per_mmbtu"],
        label="reported",
        alpha=0.5,
        range=(0,40),
        bins=80,
        density=True,
    )
    ax.hist(
        frc.loc[(frc.fuel_group_code == fgc) & (frc.fuel_cost_per_mmbtu.isna()), "filled"],
        label="filled",
        alpha=0.5,
        range=(0,40),
        bins=40,
        density=True,
    )
    ax.legend(loc="upper right")
    ax.set_title(f"Fuel Group: {fgc}")
    ax.set_xlabel("Fuel Price [$/MMBTU]")
fig.tight_layout()
plt.show()

image

zaneselvans avatar Jun 17 '22 01:06 zaneselvans

To compare the filled in values from the API vs. our aggregations for each fuel group and aggregation method, I made this plot.

sns.relplot(
    data=frc[
        (frc.fuel_group_code.isin(["coal", "petroleum", "natural_gas"]))
        & (frc.fuel_cost_from_eiaapi)
        & (~frc.filled_by.isin(["state_esc_year", "region_fgc_year"]))
    ],
    x="fuel_cost_per_mmbtu",
    y="filled",
    row="fuel_group_code",
    col="filled_by",
    facet_kws=dict(xlim=(0,40), ylim=(0,40)),
    linewidth=0.0,
    alpha=0.1,
    s=5,
)

Here the x-axes are the API based fuel price, and the y-axes are the aggregation based fuel prices. Columns are the aggregation methods, and rows are the fuel groups. In the aggregation methods fgc = fuel group code, and esc = energy source code (the more specific fuels)

image

For the petroleum and natural_gas fuel groups, there's a good 1:1 correlation between the API and aggregation based prices, but it's harder to see what's going on with coal because the points are all bunched together so I zoomed in on those prices only:

sns.relplot(
    data=frc[
        (frc.fuel_group_code.isin(["coal"]))
        & (frc.fuel_cost_from_eiaapi)
        & (~frc.filled_by.isin(["state_esc_year", "region_fgc_year"]))
    ],
    x="fuel_cost_per_mmbtu",
    y="filled",
    col="filled_by",
    col_wrap=3,
    facet_kws=dict(xlim=(0,8), ylim=(0,8)),
    linewidth=0.0,
    alpha=0.1,
    s=5,
)

For the state-level aggregations, there's some scatter, but generally a correlation between the API and aggregation based fuel prices. However, for the regional and national averages, it seems like there's a more substantial difference between the two methods. Some of the variability that's present in the API results is getting flattened by these larger spatial averages.

image

This isn't to say that the new method is necessarily worse, since there were fuel type categorizations are worse, given the fuel grouping differences between PUDL and EIA, but it does suggest that there's some benefit to be had by refining the imputation further, especially since one of our main applications of this data is to identify unusually expensive coal plants for advocates. We also have the most additional information about the coal deliveries, since we usually know the mine name, what state it's in, maybe a unique ID for the mine, the supplier name, level of ash and sulfur in the coal, what kind of contract it was delivered on, etc.

Here are the proportions of coal records that are being filled in with the different aggregations:

frc[frc.fuel_group_code=="coal"].filled_by.value_counts() / sum(frc.fuel_group_code=="coal")
original              0.713344
state_esc_month       0.144066
region_fgc_month      0.067036
national_esc_month    0.042347
state_fgc_month       0.026690
national_fgc_month    0.004286
state_esc_year        0.001138
region_fgc_year       0.001093

About 11% of all coal records end up getting regional or national average prices assigned to them, and that's a little more than 1/3 of the total coal records which are being filled in.

I'm inclined to say that this is good enough for now, but that because these are probably some of the records we care most about, it will be worth coming back to this issue and trying to use a more generic model like XGBoost or a random forest that can utilize the other information we have.

What do y'all think?

zaneselvans avatar Jun 17 '22 15:06 zaneselvans

I was ready to say "yes this is good enough", especially considering the results were so similar to our previous method of EIA API values. But now I'm confused!

Here the x-axes are the API based fuel price

I think I'm not understanding exactly what these plots are -- I thought your scatter plots near the top of this issue showed that API vs state-month average were extremely linear. Why are the state-fgc-month results so much different here? I'd be surprised if changing the agg func from mean to median made that much difference. If it did, that's interesting to know!

TrentonBush avatar Jun 17 '22 21:06 TrentonBush

@TrentonBush The earlier scatter plots are comparing all the reported data -- so the ones where there actually was data in the FRC table, and they're only being aggregated by [state, month, fuel_group]

The more scatter recent plots are only looking at data points that were not present in the FRC table, and comparing the values which were filled in by our new method (breaking it out into all the different kinds of aggregation used) vs. the API values. So it's not surprising that the correlation is worse in general.

zaneselvans avatar Jun 17 '22 22:06 zaneselvans

Oh ok, now if I squint I can see how they are related -- basically remove the line and just look at the dispersion, plus add some x-coordinate noise to account for mean to median. Ya this model makes sense to me. I think using the median is justified given the handful of big outliers I ran into. We definitely want to use the mean for the purpose of disaggregating the API results in the future though.

one of our main applications of this data is to identify unusually expensive coal plants for advocates

That's an awesome use case I'd love to learn more about! There is certainly more accuracy to wring out of this data, and that kind of anomaly detection would probably require it. Maybe we find a day or two next sprint to try other models

TrentonBush avatar Jun 17 '22 22:06 TrentonBush

Yeah, there are some wild outliers. I want to figure out how to remove them and fill them in with reasonable values. Some of them are out of bounds by factors of like 1000x -- natural gas is particularly messy. The mean values were quite skewed.

Given that the values aren't really normally distributed, would it still make sense to replace anything more than 3-4$\sigma$ away from the mean? I could drop the top and bottom 0.1% of the distribution too, but there aren't always big outliers, and sometimes more than 0.1% of the points are bad. Not sure what the right thing to do is.

zaneselvans avatar Jun 18 '22 01:06 zaneselvans

I'm doing some additional normalization to the table and moving the fuel_group_code to the energy_sources_eia table, since the value is entirely determined by energy_source_code with this mapping:

assert (frc.groupby("energy_source_code").fuel_group_code.nunique() == 1).all()
dict(frc.groupby("energy_source_code").fuel_group_code.first())
{'BFG': 'other_gas',
 'BIT': 'coal',
 'DFO': 'petroleum',
 'JF': 'petroleum',
 'KER': 'petroleum',
 'LIG': 'coal',
 'NG': 'natural_gas',
 'OG': 'other_gas',
 'PC': 'petroleum_coke',
 'PG': 'other_gas',
 'RFO': 'petroleum',
 'SC': 'coal',
 'SGP': 'other_gas',
 'SUB': 'coal',
 'WC': 'coal',
 'WO': 'petroleum'}

I'm also renaming the column to energy_group_eiaepm since it's the fuel grouping that's used for reporting in the EIA Electric Power Monthly.

zaneselvans avatar Jun 18 '22 02:06 zaneselvans