Add epacems crosswalk to etl

Open aesharpe opened this issue 2 years ago • 20 comments

Context

This Draft PR is about making sure the new EPA CEMS - EIA Crosswalk file (now in the datastore not stored in the repo) gets successfully integrated into the ETL.

I thought a lot about whether it should flow through the ETL like the rest of the data in the Datastore (i.e. have it's own extract, transform, and load modules) or get addressed separately in the glue module (as it is currently).

Because the crosswalk doesn't provide any "new" data, I decided to keep it lumped together with the rest of the glue. From a coding perspective, I think it's complicated when we treat data that comes from the same place (i.e. the Datastore) differently (i.e. some go through the regular ETL process and some are etl-ed in glue modules like eia_epacems.py). This is ok but it was confusing to figure out what was going on when I was attempting to build the code infrastructure. Having a clear protocol for all data would be best imo, otherwise a very well documented depiction of when/why data gets treated one way vs. another. This might not be the time or place to address this, however.

What this PR does:

This PR changes the eia_epacems.py glue file:

Changes the name to be more specific (and all references to the old name) --> epacems_unitid_eia_plant_crosswalk.py
Updates the functions inside the old epacems_unitid_eia_plant_crosswalk.py module to grab the raw crosswalk from the datastore rather than a csv in the repo.
Updates the transform function.
Adds DOIs so the Datastore script can the crosswalk data from Zenodo

Jun 15 '22 22:06 aesharpe

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

Jun 15 '22 22:06 review-notebook-app[bot]

Hey @aesharpe it looks like you may have deleted the old eia_epacems.py file and created an entirely new epacems_unitid_eia_plant_crosswalk.py file, rather than using git to rename the file, which results in the version history being broken, and it being impossible to see what all changed in the file (since git thinks the entire file is new, rather than knowing that the two files are the same entity, and comparing the differences between them).

Jun 17 '22 17:06 zaneselvans

Also it looks like the changes broke the ETL.

Jun 17 '22 17:06 zaneselvans

you may have deleted the old eia_epacems.py file and created an entirely new epacems_unitid_eia_plant_crosswalk.py file, rather than using git to rename the file, which results in the version history being broken

TIL! I have done this a few times in client repos

Jun 17 '22 18:06 TrentonBush

This text is from my last commit, but it's very relevant!

The epacems crosswalk file does not have a date field. It does, however, use foreign keys from eia tables that are restricted by date when the ETL is run for a subset of years. Without this addition, the integration tests will fail because the tests, based on the fast etl (aka one year of data), formulate entity tables (used to map foreign keys onto the crosswalk) that contain less data than the crosswalk (based on all years). One solution is to restrict the crosswalk data to only show the plant and generator ids from the selected subset of eia data. That's what I've implemented here.

This isn't ideal because there are some cases in which the crosswalk has erroneous generator_id data. I fixed this in the previous commit (before selecting only the plant-gen records that show up in the eia subset). If I hadn't checked these generator records they would have been excluded from the crosswalk for not matching. By only including the values that are inherently in the eia subset we guarantee to only get the data pertaining to the years we want BUT we run the risk of excluding data that was erroneously reported (and could be fixed) without warning.

How do folks feel about this?

Jun 21 '22 18:06 aesharpe

Still to do: Test crosswalk unit_id_epa values against actual CEMS data

Update: Done!

Jun 21 '22 21:06 aesharpe

Latest changes:

These changes are based on the conversation in #1758

First, I changed the EPA-EIA crosswalk from three tables to one. The previous tables figured that there was a 1:1 relationship between plant_id_epa and plant_id_eia. All the fields are interconnected, and it makes most sense for them to be in one table. I also added in the boiler_id and generator_id fields for mapping more granular EIA data.

Next, I fixed some of the column names in CEMS. What we previously called plant_id_eia was a close but not perfect approximation of ORISPL code that the crosswalk intended to fix. I changed it to plant_id_epa so as not to confuse it with the official plant_id_eia values. There were also two unit columns in the CEMS data, one called unitid and one called unit_id_epa. unitid referred to the smokestack units and unit_id_epa referred to...well...I still don't know. So I dropped unit_id_epa and renamed unitid to unit_id_epa. unit_id_epa was also erroneously defined as the smokestack unit in our documentation so the name change meant that I could leave that description as is. Lastly, I removed the facility_id column because it didn't seem to refer to anything useful.

After editing the column names, I updated the metadata, ETL functions, and tests to reflect the changes.

The CEMS transform module was relying on the old plant_id_eia column to correct timezones in the data. While all the values were in the right timezones, it still wasn't technically the right ID, so I pulled the crosswalk into the CEMS transform module to replace the old plant_id_eia column (now plant_id_epa) with the correct plant_id_eia value. I did this by filling in the long empty harmonize_eia_epa_orispl() function. Ultimately, I dropped the plant_id_epa column from CEMS in favor of the more accurate plant_id_eia once merged with the crosswalk. I also removed the add_facility_id_unit_id_epa() function because I removed those columns from the CEMS data.

TODO:

[x] decide on official name for the crosswalk

Jul 18 '22 23:07 aesharpe

As far as the naming of the crosswalk goes, what's the reporting / data umbrella that all that information is coming from, if not the EPA CEMS? Should it be epaapmd_eia or it looks like maybe they've rebranded AMPD to CAMPD so maybe epacampd_eia

Jul 19 '22 16:07 zaneselvans

As far as the naming of the crosswalk goes, what's the reporting / data umbrella that all that information is coming from, if not the EPA CEMS? Should it be epaapmd_eia or it looks like maybe they've rebranded AMPD to CAMPD so maybe epacampd_eia

I made a little discussion post (https://github.com/catalyst-cooperative/pudl/discussions/1754) about it idk if you want to chat about it there or here? I think something like epacamd_eia_crosswalk is most accurate

Jul 19 '22 16:07 aesharpe

EPA suggests citing the crosswalk as:

Huetteman, J., Tafoya, J., Johnson, T., and Schreifels, J. (2021). EPA-EIA Power Sector Data Crosswalk. Accessible at www.epa.gov/airmarkets/power-sector-data-crosswalk.

maybe something like power_sector_data_crosswalk or epa_eia_power_sector_data_crosswalk

Jul 19 '22 16:07 grgmiller

maybe something like power_sector_data_crosswalk or epa_eia_power_sector_data_crosswalk

I'm torn between using similar language to the EPA and keeping it simple. Given that all our data pertains to the power sector my feeling is that we can probably leave that part out?

Jul 19 '22 17:07 aesharpe

Maybe just epa_eia_data_crosswalk or epa_eia_crosswalk then?

Jul 19 '22 17:07 grgmiller

I think my favorite thus far is epacamd_eia_crosswalk, but maybe just epa makes more sense for column suffixes. I also don't know the extent to which we would ever integrate other non CAMD EPA data....?

Jul 19 '22 17:07 aesharpe

I don't think there's any reason to have _data_ in the name... of course it's all data.

There are so many different EPA datasets talking about related facilities and issues, I think just saying epa_eia_crosswalk is a bit vague. What's the most specific EPA program acronym that is still broadly applicable to the information contained in the crosswalk? You're saying it's wider than the CEMS, but what all sits between EPA (as a whole agency) and CEMS (a too-narrow program)? It looks like the two main options are:

CAMD: the Clean Air Markets Division (an organizational unit within EPA)
CAMPD: Clean Air Markets Program Data (a particular collection of data published by CAMD, which includes emissions allowances, emissions themselves, and compliance reporting)

Given our pattern of naming with the FERC and EIA data, where we concatenate the agency with the identifier of the data source (e.g. eia+923 or ferc+eqr) the most analogous option here to me seems like it might be epacampd_eia_crosswalk. Though I've tried to standardize on putting these identifiers in alphabetical order so you don't have to guess which one comes first if you don't know of the top of your head, so maybe eia_epacampd_crosswalk instead...

Jul 19 '22 18:07 zaneselvans

I agree that just _epa is fine for the suffixes.

Jul 19 '22 18:07 zaneselvans

Okay I see that the EPA GitHub repo uses camd_eia_crosswalk and you've already switched everything to say epacamd_eia_crosswalk and I think that's fine too.

Jul 19 '22 19:07 zaneselvans

What is going on with the RTD build failure? It's timing out on some Zenodo data retrieval, but I didn't realize it even needed Zenodo data. I guess for the data source documentation?

Jul 20 '22 03:07 zaneselvans

FYI: Looks like EPA is getting ready to do a new release of the crosswalk soon, so if there are any updates or fixes that you'd like to have integrated, now might be the time to reach out: https://github.com/USEPA/camd-eia-crosswalk/pull/25#issuecomment-1190243145

Jul 20 '22 15:07 grgmiller

Hi @aesharpe , while you're in here updating the epacems column names, I just wanted to flag this issue for you: https://github.com/catalyst-cooperative/pudl/issues/1581

Jul 21 '22 18:07 grgmiller

Also ongoing convo about next crosswalk release! https://github.com/USEPA/camd-eia-crosswalk/issues/26

Jul 21 '22 19:07 aesharpe

Just wanted to flag that this PR will close #604

Aug 24 '22 22:08 grgmiller

One overarching comment/question... are we switching from using cems to camd? I thought that camd was the overarching data program of the clean air market program and cems is the specific dataset? like the nesting goes EPA > CAMD > CEMS. I'm not sure how the crosswalk fits in there... like does the crosswalk only pertain to CEMS or other CAMD datasets? Regardless... to me it looks like you changed some names from cems to camd and left others, which feels confusing to me.

Yes, it goes EPA > CAMD > CEMS and technically the crosswalk applies to all CAMD data. It's confusing because CEMS is the only CAMD data we currently use. I decided to keep the label as CAMD rather than CEMS because a) we might want to integrate other non-cems CAMD data in the future and b) EPA calls it the CAMD crosswalk and I wanted to stay consistent with that.

Aug 29 '22 22:08 aesharpe

Codecov Report

Base: 83.2% // Head: 83.1% // Decreases project coverage by -0.0% :warning:

Coverage data is based on head (80e560b) compared to base (48cd28c). Patch coverage: 83.3% of modified lines in pull request are covered.

Additional details and impacted files

@@           Coverage Diff           @@
##             dev   #1692     +/-   ##
=======================================
- Coverage   83.2%   83.1%   -0.1%     
=======================================
  Files         65      64      -1     
  Lines       7398    7345     -53     
=======================================
- Hits        6158    6110     -48     
+ Misses      1240    1235      -5

Impacted Files	Coverage Δ
src/pudl/extract/epacems.py	`97.3% <ø> (ø)`
src/pudl/metadata/classes.py	`82.1% <ø> (ø)`
src/pudl/metadata/fields.py	`100.0% <ø> (ø)`
src/pudl/metadata/resources/epacems.py	`100.0% <ø> (ø)`
src/pudl/metadata/resources/glue.py	`100.0% <ø> (ø)`
src/pudl/metadata/resources/pudl.py	`100.0% <ø> (ø)`
src/pudl/metadata/sources.py	`100.0% <ø> (ø)`
src/pudl/output/pudltabl.py	`88.2% <25.0%> (-0.8%)`	:arrow_down:
src/pudl/output/epacems.py	`77.5% <28.5%> (-8.7%)`	:arrow_down:
src/pudl/extract/eia861.py	`93.5% <66.6%> (-3.2%)`	:arrow_down:
... and 7 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

Sep 05 '22 20:09 codecov[bot]

Updating the the module name in the scraper and archiver repos: https://github.com/catalyst-cooperative/pudl-scrapers/pull/47 https://github.com/catalyst-cooperative/pudl-zenodo-storage/pull/26

Sep 08 '22 18:09 aesharpe

I updated the DOIs in the datastore to pull the newly renamed archives, and (I think) updated the last of the old names in the code. I've also uploaded the new archives to gs://zenodo-cache.catalyst.coop so the CI has easy access to them. Assuming the CI passes here, and it's attempting to load the crosswalk, then it seems like it could be merged. All the changes from dev have been merged into this branch.

Sep 09 '22 01:09 zaneselvans

Hey @aesharpe one thing I guess this does still need is an update to the docs/release_notes.rst summarizing the changes.

I merged some small PRs adding data sources (the EIA-861 2021ER and EIA bulk electricity data) into dev over the weekend so we can know if they cause any issues with the nightly builds on Monday. I merged dev into this PR branch after that. Only one line of conflict which seemed pretty clear -- the renaming of the leading zeroes fixer in the EIA-861 extractor.

Sep 10 '22 22:09 zaneselvans

Only one line of conflict which seemed pretty clear -- the renaming of the leading zeroes fixer in the EIA-861 extractor.

How should we resolve this?

Sep 12 '22 16:09 aesharpe

Oh I resolved it by using the new name. It's all good.

Sep 13 '22 17:09 zaneselvans

This is an important after-the-fact addition: https://github.com/catalyst-cooperative/pudl/pull/1938

Sep 16 '22 22:09 aesharpe

pudl pudl copied to clipboard

Add epacems crosswalk to etl

Context

What this PR does:

Codecov Report

pudl
pudl copied to clipboard