pudl icon indicating copy to clipboard operation
pudl copied to clipboard

Transform EIA860 and EIA923 Cooling System Tables

Open aesharpe opened this issue 1 year ago • 2 comments

Overview

Closes #3393 and #3395.

Testing

How did you make sure this worked? How can a reviewer verify this?

# To-do list
- [ ] Ensure docs build, unit & integration tests, and test coverage pass locally with `make pytest-coverage` (otherwise the merge queue may reject your PR)
- [ ] For significant ETL changes, ensure the full ETL runs locally
- [ ] For major data coverage & analysis changes, [run data validation tests](https://catalystcoop-pudl.readthedocs.io/en/latest/dev/testing.html#data-validation)
- [ ] If updating analyses or data processing functions: make sure to update or write data validation tests
- [ ] Update the [release notes](../docs/release_notes.rst): reference the PR and related issues.
- [ ] Review the PR yourself and call out any questions or issues you have

aesharpe avatar Feb 16 '24 18:02 aesharpe

I think it probably makes sense to keep these tables separate post harvesting just because there is so much information that would be duplicated from the annual 860 table to the monthly 923 table.

aesharpe avatar Feb 20 '24 23:02 aesharpe

I think it probably makes sense to keep these tables separate post harvesting just because there is so much information that would be duplicated from the annual 860 table to the monthly 923 table.

I'm confused about this - to me the goal of harvesting is to combine duplicate info reported at different frequencies and reconcile it. Do you mean that the fields are overlapping, or what is meant here?

e-belfer avatar Feb 21 '24 15:02 e-belfer

TODO:

  • [ ] Still have the EIA-860 fields to do, but the EIA-923 ones appear to be behaving themselves.
  • [ ] pull the slowly-varying-fields check into a pudl.validate utility method
  • [ ] daydream about revamping our homegrown fields setup

jdangerx avatar Feb 29 '24 23:02 jdangerx

@jdangerx In addition to resolving this "where should our fields live" question, can you add the dagster instance migrate instructions (for minor version updates) somewhere to our docs? Otherwise I think this PR should be ready to merge today.

e-belfer avatar Mar 11 '24 13:03 e-belfer

OK, so: If i put the fields into the FIELDS_METADATA then I get a “fields not being used” error in test, because we don’t have these assets defined as resources. My question is, can we define these asset schemas in resources? They won't get persisted in SQLite, right, since we don't have the PUDL SQLite IO manager assigned to these assets. But we would need to make sure they don't get added to the data dictionary until they're persisted. If we can't define these schemas in resources then we need to come up with some other way to define the data types, which... ugh.

jdangerx avatar Mar 11 '24 20:03 jdangerx

By actually adding these schemas to our RESOURCE_METADATA & writing them out to DB, I hooked them up to a lot of data quality checks I hadn't before. Some changes:

  • the cooling system types are actually not reported as codes, they're reported as human-facing that are roughly equivalent to the code descriptions (but not completely!)
  • the EIA cooling system ID can be NA in rare cases - fill those to the string "NA" so that we can still use it as a primary key.
  • there is one record that's duplicated in EIA923 - deduplicate that
  • fields that EIA says are "whole fahrenheit degrees" are actually floats 🙃

There still seems to be a doc build issue, which I should look into.

jdangerx avatar Mar 12 '24 16:03 jdangerx

After much debugging of the _core_eia860__cooling_equipment ↔️ core_eia860__scd_plants foreign key which seemed to be fine, I realized the actual FK error in the tests was with the EIA923 table instead, and I had run into https://github.com/catalyst-cooperative/pudl/issues/1196. After fixing the foreign key, my local checks all worked and I hope this is reflected in the GHA run!

jdangerx avatar Mar 14 '24 13:03 jdangerx