pudl icon indicating copy to clipboard operation
pudl copied to clipboard

Retain all harvestable fields during EIA transforms

Open zaneselvans opened this issue 6 years ago • 7 comments

In many of our older EIA transformation functions, we preemptively drop columns from the tables that are being processed, in order to produce normalized tables. However, many of these columns contain information about the entities (plants, generators, utilities) that should be integrated into the entity harvesting and resolution process, which happens after the transform step.

Discarded Columns

  • Check whether the column name is defined in pudl.metadata.fields
  • If it is not defined but does correspond to an existing column, change the name in the appropriate column_map.csv under src/pudl/package_data/{data source}/ so that it matches the DB schema.
  • If the column does not correspond to any existing defined field, it may be appropriate to discard it. E.g. total_fuel_consumption_mmbtu is an annual total of monthly values that are retained, and so we don't need it.
  • If the column corresponds to a defined field (either before or after the name has been fixed) then retain it and debug any issues that keeping it around results in later on in the transform process. It should make it to the harvesting step and go into the process of informing plant/generator/boiler/utility attributes.

EIA-860

pudl.transform.eia860.ownership()

  • None

pudl.transform.eia860.generators()

  • None

pudl.transform.eia860.plants()

  • None

pudl.transform.eia860.utilities()

  • None

EIA-923

pudl.transform.eia923.plants()

  • None

pudl.transform.eia923.generation_fuel()

  • [ ] combined_heat_power
  • [ ] plant_name_eia
  • [ ] operator_name (probably utility_name_eia)
  • [ ] operator_id (probably utility_id_eia)
  • [ ] plant_state
  • [ ] census_region
  • [ ] nerc_region
  • [ ] naics_code
  • [ ] fuel_unit (should probably be dropped, since unit is implied by fuel type)
  • [ ] total_fuel_consumption_quantity (annual total?)
  • [ ] electric_fuel_consumption_quantity (annual total?)
  • [ ] total_fuel_consumption_mmbtu (annual total?)
  • [ ] elec_fuel_consumption_mmbtu (annual total?)
  • [ ] net_generation_megawatthours (annual total?)
  • [ ] early_release

pudl.transform.eia923.boiler_fuel()

This one may give you trouble. See #1847 and #1836.

  • [ ] combined_heat_power
  • [ ] plant_name_eia
  • [ ] operator_name (probably utility_name_eia)
  • [ ] operator_id (probably utility_id_eia)
  • [ ] plant_state
  • [ ] census_region
  • [ ] nerc_region
  • [ ] naics_code
  • [ ] fuel_unit (should probably be dropped, since unit is implied by fuel type)
  • [ ] total_fuel_consumption_quantity (annual total?)
  • [ ] balancing_authority_code_eia
  • [ ] early_release
  • [ ] reporting_frequency_code
  • [ ] data_maturity (WE add this field in the extraction... getting dropped b/c of aggregations. See #1847)

pudl.transform.eia923.generation()

  • [ ] combined_heat_power
  • [ ] plant_name_eia
  • [ ] operator_name (probably utility_name_eia)
  • [ ] operator_id (probably utility_id_eia)
  • [ ] plant_state
  • [ ] census_region
  • [ ] nerc_region
  • [ ] naics_code
  • [ ] early_release

pudl.transform.eia923.coalmine()

  • None -- we really do just want the very small set of columns retained here, as we're stripping them out to create a new table, normalizing the Fuel Receipts & Costs table.

pudl.transform.eia923.fuel_receipts_costs()

  • [ ] plant_name_eia
  • [ ] plant_state
  • [ ] operator_name (probably utility_name_eia)
  • [ ] operator_id (probably utility_id_eia)
  • [ ] mine_id_msha (should be dropped)
  • [ ] mine_type_code (should be dropped)
  • [ ] state (of the mine?)
  • [ ] county_id_fips (of the mine?)
  • [ ] state_id_fips (of the mine?)
  • [ ] mine_name (should be dropped)
  • [ ] regulated (mine or plant?)
  • [ ] early_release

zaneselvans avatar Jan 18 '20 00:01 zaneselvans

@cmgosnell and I are going to help get @knordback working on this issue as a way to become more familiar with the harvesting process, working with our code, Jupyter, etc.

zaneselvans avatar Sep 02 '22 19:09 zaneselvans

@cmgosnell while talking over some of these fields with @knordback yesterday, I noticed that the associated_combined_heat_power field is part of the generators_entity_eia table, but there's another combined_heat_power field being reported in e.g. the generation_fuel_eia923 table, and looking at the spreadsheets, it seems like that field pertains to the plant (which makes some sense given that generation_fuel_eia923 is reported on a date, plant, prime-mover, fuel basis).

Are these different attributes? Should there be a CHP field at both the generator and the plant level? Should this really be a permanent attribute, or is it another one that changes slowly? Does the generator field really just indicate that the generator is part of a plant that does CHP? Or that it's part of a generation unit that does CHP? Could the plant or plant-prime-fuel level CHP status be inferred from the generator-level CHP attributes?

Right now we're discarding the CHP column reported in generation_fuel_eia923.

@grgmiller or @gschivley do either of you have more context on the relationship between these two different CHP fields?

zaneselvans avatar Feb 16 '23 17:02 zaneselvans

I don't know exactly. associated_combined_heat_power originates in the generator table. I would not be surprised if there were plants that had some units contributing to a CHP and some that just generated power. I don't think it's generally a good idea to base any logic about the workings of a plant based off of the reporting structure of the generation_fuel_eia923 table. I personally would check whether this value is actually consistent across all generators within a plant before thinking about moving it. But also i could definitely imagine this changing over time (albeit very rarely!).

cmgosnell avatar Feb 16 '23 19:02 cmgosnell

It seems like we should probably do an exhaustive check of all the currently "permanent" generator attributes on the pre-harvested dataframes... and see how permanent they actually are.

zaneselvans avatar Feb 16 '23 19:02 zaneselvans

I do not have any context on these two fields.

grgmiller avatar Feb 16 '23 20:02 grgmiller

I'll hold off on this one for now.

knordback avatar Feb 16 '23 22:02 knordback

I think this is mostly done. Based on notes above I left in code dropping some of the fields in clean_generation_fuel_eia923() and clean_fuel_receipts_costs_eia923(), but I'm not certain I'm interpreting the notes correctly. There's also implicit dropping in plants_eia923(), and I don't know if that's as desired or not.

knordback avatar May 24 '23 15:05 knordback