pudl
pudl copied to clipboard
Add epacems crosswalk to etl
Context
This Draft PR is about making sure the new EPA CEMS - EIA Crosswalk file (now in the datastore not stored in the repo) gets successfully integrated into the ETL.
I thought a lot about whether it should flow through the ETL like the rest of the data in the Datastore (i.e. have it's own extract, transform, and load modules) or get addressed separately in the glue
module (as it is currently).
Because the crosswalk doesn't provide any "new" data, I decided to keep it lumped together with the rest of the glue. From a coding perspective, I think it's complicated when we treat data that comes from the same place (i.e. the Datastore) differently (i.e. some go through the regular ETL process and some are etl-ed in glue modules like eia_epacems.py). This is ok but it was confusing to figure out what was going on when I was attempting to build the code infrastructure. Having a clear protocol for all data would be best imo, otherwise a very well documented depiction of when/why data gets treated one way vs. another. This might not be the time or place to address this, however.
What this PR does:
This PR changes the eia_epacems.py
glue file:
- Changes the name to be more specific (and all references to the old name) -->
epacems_unitid_eia_plant_crosswalk.py
- Updates the functions inside the old
epacems_unitid_eia_plant_crosswalk.py
module to grab the raw crosswalk from the datastore rather than a csv in the repo. - Updates the transform function.
- Adds DOIs so the Datastore script can the crosswalk data from Zenodo
Check out this pull request on
See visual diffs & provide feedback on Jupyter Notebooks.
Powered by ReviewNB
Hey @aesharpe it looks like you may have deleted the old eia_epacems.py
file and created an entirely new epacems_unitid_eia_plant_crosswalk.py
file, rather than using git to rename the file, which results in the version history being broken, and it being impossible to see what all changed in the file (since git thinks the entire file is new, rather than knowing that the two files are the same entity, and comparing the differences between them).
Also it looks like the changes broke the ETL.
you may have deleted the old
eia_epacems.py
file and created an entirely newepacems_unitid_eia_plant_crosswalk.py
file, rather than using git to rename the file, which results in the version history being broken
TIL! I have done this a few times in client repos
This text is from my last commit, but it's very relevant!
The epacems crosswalk file does not have a date field. It does, however, use foreign keys from eia tables that are restricted by date when the ETL is run for a subset of years. Without this addition, the integration tests will fail because the tests, based on the fast etl (aka one year of data), formulate entity tables (used to map foreign keys onto the crosswalk) that contain less data than the crosswalk (based on all years). One solution is to restrict the crosswalk data to only show the plant and generator ids from the selected subset of eia data. That's what I've implemented here.
This isn't ideal because there are some cases in which the crosswalk has erroneous generator_id data. I fixed this in the previous commit (before selecting only the plant-gen records that show up in the eia subset). If I hadn't checked these generator records they would have been excluded from the crosswalk for not matching. By only including the values that are inherently in the eia subset we guarantee to only get the data pertaining to the years we want BUT we run the risk of excluding data that was erroneously reported (and could be fixed) without warning.
How do folks feel about this?
Still to do: Test crosswalk unit_id_epa
values against actual CEMS data
Update: Done!
Latest changes:
These changes are based on the conversation in #1758
First, I changed the EPA-EIA crosswalk from three tables to one. The previous tables figured that there was a 1:1 relationship between plant_id_epa
and plant_id_eia
. All the fields are interconnected, and it makes most sense for them to be in one table. I also added in the boiler_id
and generator_id
fields for mapping more granular EIA data.
Next, I fixed some of the column names in CEMS. What we previously called plant_id_eia
was a close but not perfect approximation of ORISPL code that the crosswalk intended to fix. I changed it to plant_id_epa
so as not to confuse it with the official plant_id_eia
values. There were also two unit columns in the CEMS data, one called unitid
and one called unit_id_epa
. unitid
referred to the smokestack units and unit_id_epa
referred to...well...I still don't know. So I dropped unit_id_epa
and renamed unitid
to unit_id_epa
. unit_id_epa
was also erroneously defined as the smokestack unit in our documentation so the name change meant that I could leave that description as is. Lastly, I removed the facility_id
column because it didn't seem to refer to anything useful.
After editing the column names, I updated the metadata, ETL functions, and tests to reflect the changes.
The CEMS transform module was relying on the old plant_id_eia
column to correct timezones in the data. While all the values were in the right timezones, it still wasn't technically the right ID, so I pulled the crosswalk into the CEMS transform module to replace the old plant_id_eia
column (now plant_id_epa
) with the correct plant_id_eia
value. I did this by filling in the long empty harmonize_eia_epa_orispl()
function. Ultimately, I dropped the plant_id_epa
column from CEMS in favor of the more accurate plant_id_eia
once merged with the crosswalk. I also removed the add_facility_id_unit_id_epa()
function because I removed those columns from the CEMS data.
TODO:
- [x] decide on official name for the crosswalk
As far as the naming of the crosswalk goes, what's the reporting / data umbrella that all that information is coming from, if not the EPA CEMS? Should it be epaapmd_eia
or it looks like maybe they've rebranded AMPD to CAMPD so maybe epacampd_eia
As far as the naming of the crosswalk goes, what's the reporting / data umbrella that all that information is coming from, if not the EPA CEMS? Should it be
epaapmd_eia
or it looks like maybe they've rebranded AMPD to CAMPD so maybeepacampd_eia
I made a little discussion post (https://github.com/catalyst-cooperative/pudl/discussions/1754) about it idk if you want to chat about it there or here? I think something like epacamd_eia_crosswalk
is most accurate
EPA suggests citing the crosswalk as:
Huetteman, J., Tafoya, J., Johnson, T., and Schreifels, J. (2021). EPA-EIA Power Sector Data Crosswalk. Accessible at www.epa.gov/airmarkets/power-sector-data-crosswalk.
maybe something like power_sector_data_crosswalk
or epa_eia_power_sector_data_crosswalk
maybe something like
power_sector_data_crosswalk
orepa_eia_power_sector_data_crosswalk
I'm torn between using similar language to the EPA and keeping it simple. Given that all our data pertains to the power sector my feeling is that we can probably leave that part out?
Maybe just epa_eia_data_crosswalk
or epa_eia_crosswalk
then?
I think my favorite thus far is epacamd_eia_crosswalk
, but maybe just epa
makes more sense for column suffixes. I also don't know the extent to which we would ever integrate other non CAMD EPA data....?
I don't think there's any reason to have _data_
in the name... of course it's all data.
There are so many different EPA datasets talking about related facilities and issues, I think just saying epa_eia_crosswalk
is a bit vague. What's the most specific EPA program acronym that is still broadly applicable to the information contained in the crosswalk? You're saying it's wider than the CEMS, but what all sits between EPA (as a whole agency) and CEMS (a too-narrow program)? It looks like the two main options are:
- CAMD: the Clean Air Markets Division (an organizational unit within EPA)
- CAMPD: Clean Air Markets Program Data (a particular collection of data published by CAMD, which includes emissions allowances, emissions themselves, and compliance reporting)
Given our pattern of naming with the FERC and EIA data, where we concatenate the agency with the identifier of the data source (e.g. eia
+923
or ferc
+eqr
) the most analogous option here to me seems like it might be epacampd_eia_crosswalk
. Though I've tried to standardize on putting these identifiers in alphabetical order so you don't have to guess which one comes first if you don't know of the top of your head, so maybe eia_epacampd_crosswalk
instead...
I agree that just _epa
is fine for the suffixes.
Okay I see that the EPA GitHub repo uses camd_eia_crosswalk
and you've already switched everything to say epacamd_eia_crosswalk
and I think that's fine too.
What is going on with the RTD build failure? It's timing out on some Zenodo data retrieval, but I didn't realize it even needed Zenodo data. I guess for the data source documentation?
FYI: Looks like EPA is getting ready to do a new release of the crosswalk soon, so if there are any updates or fixes that you'd like to have integrated, now might be the time to reach out: https://github.com/USEPA/camd-eia-crosswalk/pull/25#issuecomment-1190243145
Hi @aesharpe , while you're in here updating the epacems column names, I just wanted to flag this issue for you: https://github.com/catalyst-cooperative/pudl/issues/1581
Also ongoing convo about next crosswalk release! https://github.com/USEPA/camd-eia-crosswalk/issues/26
Just wanted to flag that this PR will close #604
One overarching comment/question... are we switching from using
cems
tocamd
? I thought thatcamd
was the overarching data program of the clean air market program andcems
is the specific dataset? like the nesting goes EPA > CAMD > CEMS. I'm not sure how the crosswalk fits in there... like does the crosswalk only pertain to CEMS or other CAMD datasets? Regardless... to me it looks like you changed some names fromcems
tocamd
and left others, which feels confusing to me.
Yes, it goes EPA > CAMD > CEMS and technically the crosswalk applies to all CAMD data. It's confusing because CEMS is the only CAMD data we currently use. I decided to keep the label as CAMD rather than CEMS because a) we might want to integrate other non-cems CAMD data in the future and b) EPA calls it the CAMD crosswalk and I wanted to stay consistent with that.
Codecov Report
Base: 83.2% // Head: 83.1% // Decreases project coverage by -0.0%
:warning:
Coverage data is based on head (
80e560b
) compared to base (48cd28c
). Patch coverage: 83.3% of modified lines in pull request are covered.
Additional details and impacted files
@@ Coverage Diff @@
## dev #1692 +/- ##
=======================================
- Coverage 83.2% 83.1% -0.1%
=======================================
Files 65 64 -1
Lines 7398 7345 -53
=======================================
- Hits 6158 6110 -48
+ Misses 1240 1235 -5
Impacted Files | Coverage Δ | |
---|---|---|
src/pudl/extract/epacems.py | 97.3% <ø> (ø) |
|
src/pudl/metadata/classes.py | 82.1% <ø> (ø) |
|
src/pudl/metadata/fields.py | 100.0% <ø> (ø) |
|
src/pudl/metadata/resources/epacems.py | 100.0% <ø> (ø) |
|
src/pudl/metadata/resources/glue.py | 100.0% <ø> (ø) |
|
src/pudl/metadata/resources/pudl.py | 100.0% <ø> (ø) |
|
src/pudl/metadata/sources.py | 100.0% <ø> (ø) |
|
src/pudl/output/pudltabl.py | 88.2% <25.0%> (-0.8%) |
:arrow_down: |
src/pudl/output/epacems.py | 77.5% <28.5%> (-8.7%) |
:arrow_down: |
src/pudl/extract/eia861.py | 93.5% <66.6%> (-3.2%) |
:arrow_down: |
... and 7 more |
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.
:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.
Updating the the module name in the scraper and archiver repos: https://github.com/catalyst-cooperative/pudl-scrapers/pull/47 https://github.com/catalyst-cooperative/pudl-zenodo-storage/pull/26
I updated the DOIs in the datastore to pull the newly renamed archives, and (I think) updated the last of the old names in the code. I've also uploaded the new archives to gs://zenodo-cache.catalyst.coop so the CI has easy access to them. Assuming the CI passes here, and it's attempting to load the crosswalk, then it seems like it could be merged. All the changes from dev
have been merged into this branch.
Hey @aesharpe one thing I guess this does still need is an update to the docs/release_notes.rst
summarizing the changes.
I merged some small PRs adding data sources (the EIA-861 2021ER and EIA bulk electricity data) into dev
over the weekend so we can know if they cause any issues with the nightly builds on Monday. I merged dev
into this PR branch after that. Only one line of conflict which seemed pretty clear -- the renaming of the leading zeroes fixer in the EIA-861 extractor.
Only one line of conflict which seemed pretty clear -- the renaming of the leading zeroes fixer in the EIA-861 extractor.
How should we resolve this?
Oh I resolved it by using the new name. It's all good.
This is an important after-the-fact addition: https://github.com/catalyst-cooperative/pudl/pull/1938