pudl
pudl copied to clipboard
generators_entity_eia operating_date field blank when op dates disagree
Describe the bug
I created a PUDL sqlite database with EIA860 and 923 data from 2011-2017. At least one plant (55177) has a NULL operating_date
value in the for all generators in generators_entity_eia
table. All years show that generators came online in 2001 but the files split (3 to 3) about if it was in March or June.
Bug Severity
How badly is this bug affecting you?
- Medium: With some effort, I can work around the bug.
I use operating_date
to determine a retirement year for power plants. Null values mean that plants don't have a retirement date, so I need to go back and add them to my settings file by hand.
To Reproduce
Build the datapackages with 860 data from 2011-2017.
Expected behavior
I'm not quite sure what the expected behavior should be. In this specific case the year of operation is consistent across time despite the changing month. Maybe use the most recent value instead of NULL?
Right now, this is the expected behavior -- when any of the supposedly constant fields in 860 is too inconsistent, it gets marked NA/Null, since we want to have a single value associated with these long-lived entities (plants, generators, etc.) but it's not clear what that value should be, in the general case. Is there any reason to think that the most recent value is more likely to be correct than the older values?
In this particular case, it seems like as long as the dates are close to each other they should be considered consistent (like we do with lat/lon, where they're rounded to the nearest 0.01 degrees or something). This would happen in the harvesting functions that @cmgosnell set up.
I agree that it’s difficult to generalize rules for these disagreements. The fact that it changes and is then consistent makes me trust the new values a bit more.
I’ll take a look to see which other plants have values that change over time.
I just wanted to re-flag this issue. I'm noticing missing operating dates for a specific plant (plant 136, generator 2). In the 2022 EIA-860 data, this has an operating date of December 1984, but as far as I can tell this date is missing from pudl. It looks like back in 2005 this generator had an operating date of January 1985, and in more recent versions of 860, it has a reported operating date of December 1984.
Regardless of the ability to create a single static generator_operating_date, it would be nice if there was a way to access any data that exists in the original raw EIA spreadsheets through PUDL, rather than that data being deleted. (Maybe this does exist in one of the output tables and I just haven't been able to find it yet).
I feel like this is a pretty big issue because this means that I have to go back to relying on directly reading raw excel files downloaded from the EIA website to get data I need in these cases.
@grgmiller what version of the data are you working with? I think we may have fixed this in #3340 / #3419.
We don't write the "raw" versions of the spreadsheets into the database because there's no real schema / consistent datatypes for them at that point. They just exist as dataframes.
We could potentially write the pre-harvesting _core_eia*
assets to the DB which would help with this kind of debugging without anybody needing to go back to the spreadsheets, but we don't really want people depending on those tables since they contain a ton of duplicate and internally inconsistent data.
Locally, I'm working with v2023.12.01 (which I know is outdated), but I also checked on datasette, which I assume is the most recent version of the data. Looking at the two tables mentioned in https://github.com/catalyst-cooperative/pudl/issues/3340, it looks like both the generator_operating_date
and current_planned_generator_operating_date
columns for plant 136 generator 2 are blank for all records. I also looked in core_eia__entities_generators
and core_eia860__scd_generators
and couldn't find a non-missing operating date for this generator. Is there a different table that I should be looking in?
I guess maybe the raw/pre-harvest assets may not be appropriate to write to the database, but I feel like it could make sense if:
- If there is a "entity" table that is meant to represent static attributes, and that static attribute is not consistent over time, maybe it's appropriate to treat that as an NA (although maybe an alternative approach could be to use the most recently provided or most frequently provided value)
- If there is a table that has a "report_date" column, then any column in that table should change over time and represent what was reported in the raw data for that report date, even if not consistent over time.
^ Maybe this is already the design intent, and the issue with plant 136 generator 2 is just a bug rather than an intentional choice to drop the operating date.
I'm also noticing that this seems to be affecting several generators with operating dates that are in the early 1900s - not sure if this is also a result of inconsistent operating dates being reported, or if it has to do with the age of the generator, but for example there is no operating date data for plant 62 generators 5 and 6 (built 1915); plant 63 generators 2 and 3 (built 1914), and plant 294 generator 2 (built 1918)
I think it's mostly already set up the way you're describing.
- The
core_eia__entities_*
tables contain the (few) truly static attributes. - The
core_eia__scd_*
tables contain the "slowly changing dimensions" -- values that are reported annually and often don't change from year to year but which can change (i.e. most attributes)
When multiple conflicting values of a given attribute are reported, there's a process for selecting which one gets kept, if any. The default process is "Pick the most frequently reported value, as long as it accounts for more than 70% of the non-null values that were observed for this attribute", and if no value meets this criteria, it is set to NA.
For some fields, we have other methods of deciding which value to keep, the simplest being setting a lower minimum consistency (less than 70%).
We can't straightforwardly report the raw value for any value in a table with a report_date
column, because within a single year the same attribute is often reported in many different tables, with potentially conflicting values.
In the case of generator_operating_date
that really is a static value -- the generator went into service on some past date and it shouldn't ever change. So that field comes from the core_eia__entity_generators
table. Any value of that field which you see in any other core_
or out_
table is ultimately derived from the value you find in core_eia__entity_generators
which is the single source of truth for that value after we've normalized the data. So if it's NA in the entity table, it'll be NA everywhere else downstream.
And if it's NA in the entity table, that means that the operating date was reported inconsistently, so we don't know what it really was. We could switch to taking the first reported value, or the last reported value, or the average of all observed values, or we could give the values found in one spreadsheet tab that we consider to be the authoritative home for the attribute a higher weight, etc. But we don't have a clean abstraction for choosing among a menu of different selection processes for different attributes right now.
I'll take a look at the _core_eia860__generators
asset that I have cached locally and see what's going on with the value for that plant specifically.
As expected, there are 2 operating dates getting reported, with roughly equal frequency. But it seems like #3419 should have addressed this case. @e-belfer do you know why this case would have slipped through that fix?
report_date | plant_id_eia | generator_id | generator_operating_date | |
---|---|---|---|---|
14384 | 2001-01-01 00:00:00 | 136 | 2 | 1985-01-01 00:00:00 |
31220 | 2002-01-01 00:00:00 | 136 | 2 | 1985-01-01 00:00:00 |
48252 | 2003-01-01 00:00:00 | 136 | 2 | 1985-01-01 00:00:00 |
65521 | 2004-01-01 00:00:00 | 136 | 2 | 1985-01-01 00:00:00 |
83213 | 2005-01-01 00:00:00 | 136 | 2 | 1985-01-01 00:00:00 |
101258 | 2006-01-01 00:00:00 | 136 | 2 | 1985-01-01 00:00:00 |
119481 | 2007-01-01 00:00:00 | 136 | 2 | 1985-01-01 00:00:00 |
138555 | 2008-01-01 00:00:00 | 136 | 2 | 1985-01-01 00:00:00 |
13569 | 2009-01-01 00:00:00 | 136 | 2 | 1985-01-01 00:00:00 |
18174 | 2010-01-01 00:00:00 | 136 | 2 | 1985-01-01 00:00:00 |
67756 | 2012-01-01 00:00:00 | 136 | 2 | 1985-01-01 00:00:00 |
86481 | 2013-01-01 00:00:00 | 136 | 2 | 1985-01-01 00:00:00 |
105634 | 2014-01-01 00:00:00 | 136 | 2 | 1985-01-01 00:00:00 |
112856 | 2015-01-01 00:00:00 | 136 | 2 | 1984-12-01 00:00:00 |
132914 | 2016-01-01 00:00:00 | 136 | 2 | 1984-12-01 00:00:00 |
153637 | 2017-01-01 00:00:00 | 136 | 2 | 1984-12-01 00:00:00 |
175075 | 2018-01-01 00:00:00 | 136 | 2 | 1984-12-01 00:00:00 |
197204 | 2019-01-01 00:00:00 | 136 | 2 | 1984-12-01 00:00:00 |
219934 | 2020-01-01 00:00:00 | 136 | 2 | 1984-12-01 00:00:00 |
243350 | 2021-01-01 00:00:00 | 136 | 2 | 1984-12-01 00:00:00 |
267995 | 2022-01-01 00:00:00 | 136 | 2 | 1984-12-01 00:00:00 |
293379 | 2023-01-01 00:00:00 | 136 | 2 | 1984-12-01 00:00:00 |
Thanks for the explanation - that makes sense. My guess about why https://github.com/catalyst-cooperative/pudl/pull/3419 didn't work is that even though these two dates are within a month of each other, they are in different years. To resolve this, you might need to look within a 12-month window, rather than within the same year.
Ah okay, I didn't realize it was based on calendar year and not a rolling window. Maybe this is an easy improvement to make.
That's exactly what's happening. When implementing, I considered this and didn't wind up choosing the window option, partially because it's slightly more challenging to implement but primarily because it raises the question of how to collapse multiple dates within twelve month windows of one another.
An example (with faked data) to hopefully clarify:
plant_id_eia | generator_id | report_date | generator_operating_date |
---|---|---|---|
1336 | S2 | 2010-01-01 | 1973-04-01 |
1336 | S2 | 2011-01-01 | 1973-06-01 |
1336 | S2 | 2012-01-01 | 1974-05-01 |
1336 | S2 | 2013-01-01 | 1975-01-01 |
1336 | S2 | 2014-01-01 | 1977-01-01 |
E.g., for a plant and generator like this, where the dates for all but the last row are all within 12 months of one another but the cumulative time difference between them is more than 12 months, do we want to treat them all as having a date of 1973-04-01
? Programatically it's not impossible but it becomes more complex to explain what's going on under the hood. If we want to revisit this decision we should look to #3340.
The bigger question Greg raises is whether to change our dropping behavior or our harvesting behavior for these columns - even if these dates disagreed by more than a year, perhaps there's still value in keeping a generator_operating_date
and flagging that it isn't harvested. That seems like a bigger design question.
@e-belfer Rather than using the calendar year, would it be easy / make sense to groupby the generator ID, calculate the difference between the min and max values of operating date, and if it's less than 1-year take the most recently reported value? How many of the stragglers would be left? It sounded like there were some genuinely nonsensical cases where the various operating dates are separated by decades too.
Adding my 2 cents here. Some capacity expansion models retire a plant's capacity it reaches an "end of lifetime" age. For models that do this, it's more important to have an approximately correct operating date than choose between a correct or missing date.
@zaneselvans I see that this was at least partially addressed in the v2024.5.0 release. I'm going to take a look but do you know how much generator_operating_date
coverage has improved?
I think there are fewer generators with NULL operating dates in the newest version but it's still a non-trivial number.
Is it possible to report all reported operating dates rather than NULL? I know it's potentially messy but NULL is unhelpful (unless no date has ever been reported). I end up merging it in from the 861 excel file. It would be great to have everything within PUDL instead.
To include ALL of the dates that were ever reported, structurally we would have to move the dates from the entity table to the slowly changing dimension table, which could be done, but also semantically it seems... wrong? Like generators really should have one operating and one retirement date. IIRC when I investigated the distribution of ambiguous dates, picking the last (or first, etc) date within a rolling 2-year window rather than using calendar year filled in nearly all of the missing (because ambiguous) values. And then there were just a handful with truly wild discrepancies of up to multiple decades.
For the small number of generators with ambiguous / multiply reported values, how would you go about choosing one as the canonical value?
Is there a place in the pipeline where I can see the discrepancies? Could the large discrepancies have anything to do with diesel gensets that get moved from one location to another?
From the user perspective, it feels like out_eia__yearly_generators
should include the information from 860 in each reporting year. I understand that it's a mix of static attributes, calculated data, etc. But it would make sense to see the reported operating date in each row.
Is there a strict need to choose the canonical value? Maybe it's a matter of taking a normative vs empirical perspective on the data. There should be a single date that every generator came online but in practice all we have is the reported values.
@gschivley The notebook I used to originally view and debug these differences is linked in this comment: https://github.com/catalyst-cooperative/pudl/issues/3340#issuecomment-1936140757. I can pickle a dataframe of the most dramatic differences and share it here tomorrow morning.
FYI - I'm picking back up on this issue since it's been a consistent thorn for multiple people, so expect further updates from me in the coming weeks.
Certainly agree about the 365 day window instead of the calendar year, that seems like an obvious improvement. That still leaves some edge cases (<10 generators).
Here are the edge cases I'm seeing, where the dates disagree by more than a year. Not many in number, and a few are just over the year window, but there are some much larger discrepancies here. Also downloadable here as a pickle.
plant_id_eia | generator_id | generator_operating_date | report_date | operational_status |
---|---|---|---|---|
1391 | 4A | 1982-09-01 | 2001-01-01 | existing |
1391 | 4A | 1987-08-01 | 2011-01-01 | existing |
1403 | 6(4) | 1971-05-01 | 2011-01-01 | existing |
1403 | 6(4) | 1992-03-01 | 2001-01-01 | existing |
1404 | 7A | 1973-04-01 | 2011-01-01 | existing |
1404 | 7A | 1974-09-01 | 2001-01-01 | existing |
1404 | 7B | 1973-04-01 | 2011-01-01 | existing |
1404 | 7B | 1974-09-01 | 2001-01-01 | existing |
2115 | IC6 | 1963-03-01 | 2010-01-01 | existing |
2115 | IC6 | 1965-09-01 | 2015-01-01 | existing |
2869 | 1A | 1973-06-01 | 2014-01-01 | existing |
2869 | 1A | 1983-06-01 | 2001-01-01 | existing |
3604 | 6A | 1957-06-01 | 2013-01-01 | existing |
3604 | 6A | 1997-06-01 | 2001-01-01 | existing |
58265 | S-13 | 2004-06-01 | 2016-01-01 | existing |
58265 | S-13 | 2006-06-01 | 2012-01-01 | existing |
58265 | S-17 | 2004-06-01 | 2012-01-01 | existing |
58265 | S-17 | 2006-11-01 | 2016-01-01 | existing |
A couple of those 10+ year discrepancies look like maybe they could be typos or OCR errors, with just a single digit difference in the date. 1957 vs 1997? 1973 vs 1983?
Decision moving forward is to prioritize filling these nulls:
- bump the window to 2 years
- manually research the remaining 5 generators
- add an assertion error for any new ones that pop up
If we can't resolve the generator operating dates through manual research, we can resolve these 5 by taking, say the most recent date with >40% consistency.
@gschivley Let me know how this sounds to you or if you have any feedback.
If we focus on the larger discrepancies, here's what I've come up with so far:
plant_id_eia | recent report is older/newer | notes |
---|---|---|
1391 | Newer (1982 -> 1985) | No idea |
1403 | Older (1992 -> 1971) | Maybe it was originally generator_id 4 then repowered in 1992 and renamed 6(4) |
2869 | Older (1983 -> 1973) | I think this was a data error. Unit 1B has always been listed as operating in 1973. I found a plant contact and emailed them. The out_ferc1__yearly_steam_plants_sched402 also shows a plant construction/operation year of 1973. |
3604 | Older (1997 -> 1957) | There was a repowering in 1997 |
It looks like all the major changes happen between 2011-2014. Maybe EIA put out guidance to report original online date rather than repower/modification date? Ironically, if the newer operating date is a repower it is probably a better value for my users.
Just heard back from West Lorain (2869). Both 1A and 1B came online in 1973, no idea why it was reported wrong at first. Maybe OCR like @zaneselvans suggested. Or a typo.
If we can't resolve the generator operating dates through manual research, we can resolve these 5 by taking, say the most recent date with >40% consistency.
@e-belfer are you saying take the most recent date so long as it accounts for >40% of reported values? Or the most recently reported date? I'd lean towards saying recently reported dates are more likely to be "correct" (although the 1992 repowering of a turbine originally installed in 1971 is information I wish EIA was reporting).
@gschivley Thanks for your investigative work here, it's been very helpful!
It seems like the manual cases we have confirmation on, we wind up just wanting to take the most recently reported date. So rather than ending up with 10 different rules, let's simplify here and just take the last reported generator operating date when no consistent value can be found.
@gschivley & @grgmiller this change got merged into main
last night and the nightly builds passed, so you should be able to grab a new copy of the DB and find entirely non-null generator operating dates.