pudl
pudl copied to clipboard
Transform EIA860 and EIA923 Cooling System Tables
Overview
Closes #3393 and #3395.
Testing
How did you make sure this worked? How can a reviewer verify this?
# To-do list
- [ ] Ensure docs build, unit & integration tests, and test coverage pass locally with `make pytest-coverage` (otherwise the merge queue may reject your PR)
- [ ] For significant ETL changes, ensure the full ETL runs locally
- [ ] For major data coverage & analysis changes, [run data validation tests](https://catalystcoop-pudl.readthedocs.io/en/latest/dev/testing.html#data-validation)
- [ ] If updating analyses or data processing functions: make sure to update or write data validation tests
- [ ] Update the [release notes](../docs/release_notes.rst): reference the PR and related issues.
- [ ] Review the PR yourself and call out any questions or issues you have
I think it probably makes sense to keep these tables separate post harvesting just because there is so much information that would be duplicated from the annual 860 table to the monthly 923 table.
I think it probably makes sense to keep these tables separate post harvesting just because there is so much information that would be duplicated from the annual 860 table to the monthly 923 table.
I'm confused about this - to me the goal of harvesting is to combine duplicate info reported at different frequencies and reconcile it. Do you mean that the fields are overlapping, or what is meant here?
TODO:
- [ ] Still have the EIA-860 fields to do, but the EIA-923 ones appear to be behaving themselves.
- [ ] pull the slowly-varying-fields check into a
pudl.validate
utility method - [ ] daydream about revamping our homegrown fields setup
@jdangerx In addition to resolving this "where should our fields live" question, can you add the dagster instance migrate
instructions (for minor version updates) somewhere to our docs? Otherwise I think this PR should be ready to merge today.
OK, so: If i put the fields into the FIELDS_METADATA
then I get a “fields not being used” error in test, because we don’t have these assets defined as resources. My question is, can we define these asset schemas in resources
? They won't get persisted in SQLite, right, since we don't have the PUDL SQLite IO manager assigned to these assets. But we would need to make sure they don't get added to the data dictionary until they're persisted. If we can't define these schemas in resources then we need to come up with some other way to define the data types, which... ugh.
By actually adding these schemas to our RESOURCE_METADATA & writing them out to DB, I hooked them up to a lot of data quality checks I hadn't before. Some changes:
- the cooling system types are actually not reported as codes, they're reported as human-facing that are roughly equivalent to the code descriptions (but not completely!)
- the EIA cooling system ID can be NA in rare cases - fill those to the string "NA" so that we can still use it as a primary key.
- there is one record that's duplicated in EIA923 - deduplicate that
- fields that EIA says are "whole fahrenheit degrees" are actually floats 🙃
There still seems to be a doc build issue, which I should look into.
After much debugging of the _core_eia860__cooling_equipment
↔️ core_eia860__scd_plants
foreign key which seemed to be fine, I realized the actual FK error in the tests was with the EIA923 table instead, and I had run into https://github.com/catalyst-cooperative/pudl/issues/1196. After fixing the foreign key, my local checks all worked and I hope this is reflected in the GHA run!