pudl icon indicating copy to clipboard operation
pudl copied to clipboard

Transform EIA 860 environmental equipment summary and association data

Open e-belfer opened this issue 1 year ago • 15 comments

Short summary:

From 2014-2021, important summary information on each plant's associated control equipment is compiled in the Emissions Control Equipment table. From 2009-2021, the same information on IDs, cost, equipment type and operational status for SO2 and PM control is available in the table associated with each type of equipment in the EnvrEquipment dataset. More minimal information is also available on NOx and mercury controls. To effectively incorporate information on control equipment investments and operational status throughout time, we should pull the data from both the EnvrEquip tables (6.2 D-G) and the Emissions Control Equipment table, where available. When there is more than one source of information available for a given piece of control equipment, we should compare these values and harmonize them.

The output of this result will be an Emissions Control Equipment table for 2009-2021, with a unique ID for each piece of control equipment, containing the following information:

  • Control ID
  • Control type (boolean for particulate type)
  • Equipment type
  • Operating month and year of the control equipment
  • Operational status
  • Acid gas control (boolean)
  • Capital cost of equipment
### Tasks
- [ ] #2561 
- [ ] https://github.com/catalyst-cooperative/pudl/issues/2586

Preliminary Task Detail

(more in individual PRs)

Metadata preparation

  • [x] Rename operating and retirement dates in pudl.package_data.eia860 to distinguish from generator operating/retirement dates

Emissions Control Equipment Table (2014-2021)

  • [x] Add all code columns to fields.py and define code metadata
  • [x] Convert inservice month and year columns to a date
  • [x] Minor QC on total cost column
  • [x] Normalize “Acid Gas Control?” column into boolean
  • [x] Define a unique PUDL ID for each piece of equipment, and make a shared control_id_pudl column that acts as a primary key for this dataset. Propagate this column to other incorporated datasets.

Association tables (2009-2021): Boiler Cooling, Boiler Particulate Matter, Boiler SO2, Boiler NOx, Boiler Mercury

  • [x] Add all code columns to fields.py and define code metadata
  • [x] Minor cleaning (e.g., make sure no plant or control IDs are missing)
  • [ ] Propagate unique PUDL ID from Emissions Control Equipment tab, or generate ID following same format for data prior to 2014 -- need to do as part of cooling and stack flue equipment tables

EnvrEquipment tables (FGD, FGP, Cooling, Stack Flue)

  • [ ] During transform of raw data, extract ID column, Inservice Month, Inservice Year, and Cost Total columns. For the FGP table, this is only applicable to 2009-2013 data.
  • [ ] Subject these columns to same transforms as above. Reuse functions.
  • [ ] Where available, compare these values to those reported in the Emissions Control Equipment table.
  • [ ] For each plant, if the same ID appears in multiple tables, compare cost, service date, and equipment type columns to identify whether they refer to the same piece of equipment (see Anticipated problems below)

Denormalized boiler-control equipment table

  • [ ] Join boiler IDs from association tables to the Emissions Control Equipment table, matching on plant and utility and control ID

Anticipated problems:

  • There are no NOx or mercury control IDs prior to 2014 in this dataset_. In the 2009-2013 data there is a boolean associated with mercury control, several columns containing operational status of NOx controls and column with information on control types in the Controls tab of the EnvrEquip data. These are reported at the boiler level, and are already integrated into the boiler attribute tables. As these are not associated with cost or installation dates, there doesn't seem to be much value into faking control IDs for this information, which may reflect 1 or more pieces of equipment. NOx control equipment IDs are reported prior to 2014 in the EIA 923 Schedule 8c data, and could also be incorporated into this analysis if it is of interest.
  • Equipment to boiler associations are not 1:1, but m:m. This shouldn’t be too much of an issue so long as we aren’t trying to bring this table into our existing entity structure, but should be kept in mind when performing joins.
  • Cost values are sometimes reported in thousands of dollars, and sometimes in dollars. Units should be harmonized.
  • Sometimes NOX ID 1 and SO2 ID 1 are the same piece of equipment in a plant, and sometimes not. This is spelled out explicitly in the EnvrEquip table, but not prior to 2014. We will need to compare additional columns to see whether these are the same or not when creating control equipment IDs. Simultaneously, this strengthens the argument for adding a unique ID column.
  • EnvrEquip and EnvrAssoc data for identical columns may differ for 2014-2021 data. We'll need to decide how to reconcile these differences.

Additional possible tasks:

  • [ ] In the boilers_annual_eia860 table, existing columns refer to strategies for emissions control, but are often empty. We could compare these to operational control equipment to generate more accurate records of what emission control strategies are in place at the boiler-level. Perhaps we should collapse the many strategy code columns in the boilers entity table into a few consolidated columns based on observed equipment types.
  • [ ] Attempt to backfill acid gas control boolean column
  • [ ] Incorporate cooling IDs, cost and installation dates in the same manner as above. Potentially, bring in the cooling system data reported in EIA923.
  • [ ] EIA923 Schedule 8C also reports FGD, FGP, Mercury and NOx control information from 2008-2021, including operational status. This does not include information on in-service date, but does include information on FGD operational and maintenance costs, among other information that may be of interest.

e-belfer avatar Feb 27 '23 22:02 e-belfer

We'd be very interested in seeing if there is a way to get access to this data or ease the process of, outside of PUDL, accessing this data. In particular we are interested in the following:

  1. The in-service date for emission control equipment
  2. The capital costs of emission control equipment

This data helps understand important details about recent investment in coal plants which helps with prospects for retirement or CCS.

I'm wondering if there is a way of bringing in this data without fully cleaning it. Things like merging together all the environmental equipment association tables and the basic fixing of EIA data weirdness in emissions_control_equipment.

It probably also makes sense to connect with @mariacastillo21 since she's worked with these tables and might have suggestions about how they could be processed.

arengel avatar Apr 25 '23 01:04 arengel

@arengel Together with @aesharpe and @cmgosnell I scoped out what I think is a very feasible path forward to bring this information into PUDL, and updated the issue above. Let me know if anything is unclear, or you have any questions.

Based on my work extracting the data, I think it's quite feasible to process information on in-service date, equipment costs, and associated boiler IDs into PUDL. Here are a few additional questions it would be helpful to have clarity on:

  • Do you only want data from 2013-present, or would you be interested in the full time-frame of data (2009-2021)? I've written the task list above to represent the workflow for backfilling this table using older data from the EnvrEquip column, at present.
  • Are there other columns relating to emissions control equipment that would be of interest to integrate at the same time?

e-belfer avatar Apr 27 '23 18:04 e-belfer

With the new ETL process in Dagster we can get a rough draft of these tables into the DB easily without fully normalizing and linking them with everything else pretty easily, along the lines of what's happening to the EIA-861 tables now.

For the capital costs of emissions control systems, I think they were only reported in these tables for a few years and are no longer available from EIA. Or if they are they've been moved elsewhere.

zaneselvans avatar Apr 27 '23 18:04 zaneselvans

@zaneselvans To my understanding, capital costs are currently reported in the Total Cost (Thousands of Dollars) column of the Emissions Control Equipment table, which has the following definition: "The nominal installed cost for the existing emissions control equipment or the anticipated cost to bring the planned piece of emissions control equipment into commercial operation"

@arengel If this is not the column you're referring to, this would of course be helpful to know!

e-belfer avatar Apr 27 '23 18:04 e-belfer

Hmm. It looks like there's also a bunch of detailed emissions equipment O&M and capital cost data in the EIA-923 Schedule 8. Maybe There was more detailed cost reporting for a couple of years in the EIA-860 and it got incorporated into EIA-923? I'm not sure where I was getting this idea from.

Oh, it looks like it was probably the flue gas particulate emissions control costs, which show up in the EIA-860 through 2012, but not in later years.

zaneselvans avatar Apr 27 '23 20:04 zaneselvans

There is definitely more detailed cost information in Schedule 8. The FGP control costs from 2014-present are now reported in the Emissions Control Equipment tab (which has one row for each piece of control equipment), as is noted above! There's some tab and ID renaming that happens between 2013-2014.

e-belfer avatar Apr 27 '23 20:04 e-belfer

Thanks @e-belfer!

Total Cost (Thousands of Dollars) on the Emissions Control Equipment tab of 6_1_EnviroAssoc_YXXX.xlsx is indeed the one I was referring to, I don't remember how long its been there but it is definitely there in 2021.

In terms of what I think would be useful I think we'd want a couple things:

  • All the columns from the Emissions Control Equipment tab. The total cost and inservice date columns are the most important but we'd also want retirement date, status, and the others.
  • I think the hard part is what to do with the all the emission control equipment IDs and providing a way to link them to boilers using the association tables without turning this into a grand many:many problem in which you have to aggregate or allocate costs. My proposal here would be 2 tables, at least as an interim step:
    1. one that is basically just the cleaned Emissions Control Equipment tab.
    2. one that combines all the boiler <> equipment ID association tabs, this table will capture all those many:many associations. Hopefully this could just be an association table or crosswalk that would not change between years.
  • This allows you to make the data accessible without having to figure out a general approach to integrating it all together. For a lot of uses I think the integration can be a simple aggregation, but I'd argue against doing it in PUDL because for other use cases you are looking at a small number of plants and want as granular data as is available. Also, after using this data some, we might come up with better ideas of how to do a general integration.
  • As for other column / tabs, there is valuable information in the Cooling, FGP, FGD, and Stack Flue tabs of 6_2_EnviroEquip_YXXX.xlsx but I'd put it as a lower priority. In those cases, if it isn't too big a pain to do a simple cleaning of those tables, I wouldn't want to stand in your way, but I wouldn't go much beyond basic cleaning (e.g. I wouldn't worry about reconciling costs between them and Emissions Control Equipment). Users could then use the equipment association table mentioned in (ii) above to link this data to boilders and then generators.

Before saying anymore, I want to acknowledge that I am not that familiar with issues around boiler reporting and how they get associated with generators. But, it sort of seems like, at least as regards Emissions Control Equipment, sorting out the issues with differently reported past years may not be as important or even required because the table in the most recent year is complete. And if not I'd be fine with only going back to 2013 or whenever the data becomes more or less consistent.

arengel avatar Apr 27 '23 20:04 arengel

@arengel, @e-belfer correct me if I'm wrong here, but I think that the main difference between getting all years vs. just 2014+ is where the columns come from. If we only want 2014+ we can take the table from the Emissions Control Equipment tab as is, and if we want older years, we'll combine columns from various tabs in the the EnviroEquip spreadsheet. They're all reported at the same granularity, so there's no need for any aggregation or complex merging. The only difference is that the earlier years of data may have less mercury and NOx data.

Let me know if there are any preliminary checks you'd like us to run to see if the older years would satisfy your needs. Otherwise we can just go ahead and integrate 2014+

aesharpe avatar Apr 27 '23 21:04 aesharpe

I think the complex merging is with connecting emission control equipment to boilers, that's what I was proposing to avoid figuring out for now.

On going back further, it is lower priority on our side, so I'd propose sticking with the tables that are easy-ish to deal with for now.

Also, it's not obvious to me, at least for the Emissions Control Equipment tab that you need all years, it looks like the most recent year might be complete. Or it might only be complete for currently operating boilers or plants.

arengel avatar Apr 27 '23 22:04 arengel

I wouldn't be surprised if the emissions control equipment associations aren't entirely constant with time. We've been surprised in the past at just how many of the generator attributes do slowly evolve, for example the changes in the prime mover or technology type. But hopefully those instances are rare, since big reconfigurations of how the units in a plant connect to each other are also rare.

zaneselvans avatar Apr 28 '23 01:04 zaneselvans

I agree the 2021 data should provide a relatively complete source of information, particularly thanks to the addition of the retirement date columns in more recent years of EIA data. But, in addition to the question about whether only operating boilers are covered, I think there's a question about the completeness of data for equipment retirements. We do see retirement dates going back to 1991 in the 2021 table. The survey itself, however, only asks for all equipment that was in operation in the past year, or that is expected to be in operation in the near future. It's possible that the 2021 table is complete, or that there would be some equipment retirements missed because of the framing of the survey itself. I think the workload involved in integrating 2014-2021 data won't be substantially more involved than just pulling the 2021 data, and if looking at changes in equipment over time is important this might help flesh out some more information here. Doing a quick comparison of a few years of data to the 2021 table would be a fast gut check on whether this is actually substantiated, of course. @mariacastillo21 I'd also be curious if you've seen this issue come up in your work with these tables.

Regarding the boiler associations, the proposal of making a mega-association table sounds like a great interim solution for sidestepping the m:m complications.

e-belfer avatar Apr 28 '23 14:04 e-belfer

@mariacastillo21 or @arengel do you know what's up with the rows in the Emissions Control Equipment table that have all NA values for XX_control_id and acid_gas_control? Those rows still have a emission_control_equipment_type listed, but nothing else. Do you know what it means for a plant to have emissions equipment but no control ids or acid gas control? It only occurs for emission_control_equipment_type values 'SR', 'OT', 'ACI', 'LNB'. Only the LNB technology is always NA, the others sometimes have control ids or acid gas control values reported.

ANSWER: almost all of these have an operational_status_code other than OP which makes sense that they would not yet or no longer have an ID attached to them. There are only 7 of these rows that do say OP. Three of them are the LNB technology which is never associated with control ids or acid gas control. Two of them are close to the operating year and seem to get updated in later years to have control ids, and the last two (from the same plant: 4941) are only reported in 2016 and 2017 and then seem to go away.

aesharpe avatar Apr 28 '23 21:04 aesharpe

Also @mariacastillo21 I'm assuming that control ids reported in the same row are part of the same equipment given the shared emission_control_equipment_cost , installation_year, and emission_control_equipment_type columns. Does that seem fair? Not all of controls ids are the same in a given row (though many of them are), and I'm assuming this difference is not important.

I don't plan on cleaning the control_id columns unless you think it's worth exploring. I do plan on doing some checks to see if the control ids are consistent over time.

aesharpe avatar Apr 28 '23 22:04 aesharpe