pudl
pudl copied to clipboard
why is basically nothing in the rate base table from `core_ferc1__yearly_depreciation_by_function_sched219`in 2021?
see this comment that id-ed the problem. track it down & fix it. presumably its from the og table transform.
@cmgosnell:
Out of 205 unique entities in the raw XBRL instant data, there are only 4 utilities that report data in the leafy "accumulated provision for depreciation of X" columns of the XBRL instant data in 2021. The rest report data in what is ultimately the "total" plant function of the "accumulated depreciation" factoid. For some reason, the corrections from subtotal to total in the plant_function
sub-dimension are not getting propagated to the rate base table.
To see the values reported in this table:
out_ferc1__yearly_rate_base[(out_ferc1__yearly_rate_base.table_name == "core_ferc1__yearly_depreciation_by_function_sched219")&(out_ferc1__yearly_rate_base.report_year==2021)&(out_ferc1__yearly_rate_base.ending_balance.notnull())]
These are the same utilities observed in the transformed table:
from dagster import AssetKey
from pudl.etl import defs
asset_key = "core_ferc1__yearly_depreciation_by_function_sched219"
df = defs.load_asset_value(AssetKey(asset_key))
df[(df.report_year==2021)&(df.plant_function!="total")&(df.ending_balance.notnull())]
There are 0 corrections records in the rate base table for this table:
df[(df.report_year==2021)].row_type_xbrl.value_counts()
row_type_xbrl | |
---|---|
reported_value | 4540 |
calculated_value | 0 |
correction | 0 |
That suggests the issue isn't pruning but rather actually happening in the generation of the correction records originally. Which makes sense because we currently aren't adding sub-total to total corrections, so we are effectively losing all "total" records in this table.
@zaneselvans i think i would like your help in thinking about the error metrics bc i changed the abs_diff
to include diffs of null values filled in with 0s on both the calculated and reported value sides (in this comment, really only on 1241). that way the correction records are much much more all encompassing. but this resulted in the metric checks failing because abs_diff
is the main input to all of the metric checks and with this change there are way more calculated records with abs_diff
values. The relative_error_magnitude
is the main one that is now failing much more frequently. which makes sense because it is now including a lot of null vs some value.
We could force the metric checks to behave like they used to (i think), by somehow only checking these values when the reported and/or calculated values are the non-null. But I think it makes more sense to adjust the tolerance values because this new change actually feels more representative of the data itself.
Is this issue still outstanding?
I'm working on an update to our net_plance_balance table, which compiles values of original cost, accumulated depreciation, and the net plant balance by FERC technology then adds an estimate of renewables net plant balance as further detail of the "other" category. I'll then use that renewables breakout as further detail of your rate base table - but not expecting that you'll integrate our estimate of renewables because it's modeled rather than reported data.
I'm using core_ferc1__yearly_depreciation_by_function_sched219
for the accumulated depreciation data, but finding the same issue you have here that most 2021 data is missing.
I checked the raw xbrl table accumulated_provision_for_depreciation_of_electric_utility_plant_functional_classification_section_b_219_instant
and I see data values there for accumulated depreciation by function than don't propogate into the PUDL table core_ferc1__yearly_depreciation_by_function_sched219
. So I don't know what the issue is, but I do think the raw data exists rather than is missing.
Hey @jrea-rmi yes I think this is still outstanding, and work on it is happening in PR #3450
@cmgosnell that work seems to be focused on corrections. But I think there's an issue in coverting the raw xbrl table into the PUDL table core_ferc1__yearly_depreciation_by_function_sched219
, a separate issue from calculation corrections.
hey @jrea-rmi sorry for the delay on getting back to this! This is a good find without a super straightforward solution. I'm going to lay out what I understand to be the problem and a possible solution but i will want @jdangerx to verify my assessment here because your much more familiar with the raw xbrl data than I am.
The high level problem here is that FERC does not replace records when a utility comes back and files an update, so we filter for the most recent updated record (pudl.io_managers.FercXBRLSQLiteIOManager.filter_for_freshest_data
). In this table, basically all of the records got an update that only contains the total records, whereas the previous update does include more detail.
If this loss of detail is a blocker, I could imaging making an exception for this this table/year only where we grab the slightly less fresh data. but only if a few things are true:
- can we easily id the most recent and penultimate records?
- for every utility, when the totals are reported both in the most recent and penultimate record for 2021, are the totals at least the same?
- would this be maintainable? when we make subsequent ferc1 xbrl sqlite dbs, will it be obvious to choose the freshest vs penultimate record?
I'm going to try to answer the first two questions. but I'd love @jdangerx 's perspective on this!!
Great. This loss of detail is definitely important for us!
If there isn't an update to any of the detailed by function values, I sure hope the totals are the same. And if they are, then it seems reasonable to use the penultimate record. Would that be easier than combining updated/previous records/filings?
So for each XBRL context (utility name, report year, ... - the primary key, basically) we could have multiple sets of facts from multiple XBRL filings.
Currently we treat the most recent filing as completely authoritative - if the most recent filing reports everything as nulls, we just trust that that's more accurate.
It would be very easy to say "sort all the facts for this context by publication time, and take the latest non-null reported value for each."
The new way is preferable in cases like this, where the newer filings only contain the updated values.
But the old way is preferable if a value was erroneously reported as non-null, i.e. someone put a number in the wrong field.
I sort of think that cases like this are more common, so would be happy to switch the behavior over - but this might uncover some other weird behavior. I don't think there's a good blanket decision to make here.
To capture this data nuance, we could try to dynamically apply the two different approaches:
- group the XBRL records by "context keys"
- write custom deduplication logic:
- if there's a steep dropoff in non-null values from one publication date to the next, assume this is a "diff only" filing. Apply the "last known non-null value" approach
- otherwise, assume this is a "whole dang report with updated values" filing. Apply "take most recent filing at face value" approach
- we could use specific context values, e.g. report year, utility name, etc. to determine which approach to use also
This gives us some broadly correct behavior while then allowing us to apply spot fixes if necessary... that refactor wouldn't be too big of a time sink, I think - the code is pretty easy to test and there's a natural place to pull out the 'deduplicate a group' logic. If we want to go for that I bet we could get it done in 5-10h - @cmgosnell I'm happy to crank on it if you have too much stuff on your plate.
@jdangerx if you feel like you have capacity to take this on that would be swell! I would be a little reluctant to change the default behavior, but it might be a good idea. It sounds like it cooould end up being simpler.
And maybe i'm off on this but could this kind of nulls perhaps be a result of the sql-ification of the xbrl data? as in... can the utils re-report one specific fact? and we take that fact and slurp it into a db table and it looks like it has a bunch of nulls in the same record. I'm not sure if there is a way to distinguish a null bc its really null vs a null bc the respondent only updated a few values.
I think you have more context on how simple and/or appropriate applying either approach would be so I'm happy to defer to you on this one.
Did some more digging & thinking, screed ahead!
Here are the possible ways to deduplicate:
- take the first one ("first snapshot")
- take the last one ("last snapshot")
- take the most complete whole filing ("best snapshot"), using publication time to break ties
- take the latest non-null value ("apply diffs")
- take the first non-null value ("I'm having trouble figuring out why someone would do this and am only including for completeness")
There are a bunch of tables with this data loss issue - if I switch from "last snapshot" to "first snapshot" dataset-wide, I see 3 tables where we have at least 1 completely null column in 2021 due to the "last snapshot" methodology. This is a lower bound on the data loss since I didn't look for more subtle forms.
-
raw_ferc1_xbrl__summary_of_utility_plant_and_accumulated_provisions_for_depreciation_amortization_and_depletion_200_instant
-
raw_ferc1_xbrl__electric_energy_account_401a_duration
-
raw_ferc1_xbrl__accumulated_provision_for_depreciation_of_electric_utility_plant_functional_classification_section_b_219_instant
The "first snapshot" approach means we miss out on all updates whether they are snapshot updates or diff updates. It also means we miss out on all the problems we have dealing with updated data and multiple filings, but I think the value of getting updated data outweighs that benefit.
This data loss we've found is due to applying "last snapshot" everywhere, even when the latest filing was a diff. So we shouldn't blanket apply that either.
We could apply "apply diffs" everywhere, which gets us the latest / most complete data possible. But...
-
an individual core table row may now correspond to the data from multiple XBRL filings. This makes investigation a little harder since we'll have to look at all the source data, and is why we decided to go with "last snapshot" in the first place.
-
we also can't tell if a diff explicity sets a value to
null
vs. just not reporting any changes. We could add some logic inferc-xbrl-extractor
to do that, which might be nice, but feels out of scope of this issue.
We could apply "best snapshot" everywhere, which would keep the 1:1 mapping from table row to XBRL filing. But...
-
I think the 1:1 mapping is not important to keep since it's not conceptually how the XBRL filings work - each XBRL context represents some sort of primary key, which can have many facts mapped to it over many filings. Unfortunately, I think this means that we have to look at all the raw data associated with a specific PK to actually understand the data in the core table.
- side note: we drop publication date and filing name in the core tables anyways - so this 1:1 relationship is already being erased in our existing system
-
we'd miss out on actual diff updates
Weighing all these, my conclusions are:
- it's not that important to keep a 1:1 mapping from core table to filing
- we'll need to do more spot fix logic for "apply best snapshot" than we will for "apply diff"
- the data-loss failure case of "apply diff" is "we do not erase a value that should have been null from the latest snapshot"
- the data-loss failure case of "apply best snapshot" is "we still don't erase that value, and we don't get the updated data"
My pitch is this:
- we change the default methodology to "apply diff" - I think picking up updates is more important than keeping a neat mapping from core table row to XBRL filing
- we allow spot-fix logic to switch to "best snapshot" in specific XBRL contexts where we identify a problematic diff somehow. This would technically work where the deduplication logic is now, but would fit much better if we refactored to move the deduplication logic into the FERC XBRL transformation framework.
- we fix ferc-xbrl-extractor to differentiate between non-reported facts and reported-null facts. IMO this means changing the extractor to extract a stacked version of the current raw XBRL tables. Then "non-reported" facts would simply be not present in the data, while reported-null facts would be present in the data, and we would be able to handle that difference within PUDL.
To fix this data loss issue in the rate base table, the minimal change is the first bullet. The next two bullets are useful infrastructural investments which will make future XBRL-based work easier.
My current plan, then, is to make a small PR for the first bullet and file issues/put TODOs in the code for the other two. If we think that further investments are worthwhile right now I'm happy to take those on too - but that depends on our priorities, project budget, etc. @cmgosnell @jonrea do we have specific expectations for how much more time we want to spend on this project?
wow, great thinking here.
I agree that the best way to do this is the "apply diff" method, and that it will be important to be able to differentiate between non-reported facts and reported-null facts.
In terms of RMI priorities, this is at least near the top. For the next month or two, we want to keep our overall monthly budget within or below what it's been previously. Let's review that in our next check-in on April 5, but definitely worth putting some effort in between now and then.
After talking a bit more with @cmgosnell - let's do a little bit more investigation to better understand the impact of changing our deduplication approach:
- Let's take a look at the raw XBRL data and see if we can find any actively reported null values there - that can help us understand if we'll be missing any instances where a value is changed from non-null to null.
- Let's also take a look at the number of non-null values per XBRL context per filing - it may be easy to guess whether a filing is a diff or a snapshot, and then act accordingly.
Then we can think about what the best option for handling this data is.
To find the values that are actually reported as null values, I looked for xsi:nil="true"
for all the filings that might have 2021 data (e.g. 2022 and 2021 filings.)
What I saw was that these numbers tended not to change much between filings for the most part. I'm not sure if any of these are values that have changed from non-null to null. But it does seem like we should probably take a look at if there is a nice way to tell the difference between diff filings and snapshot filings, since some of these numbers seem non-trivial.
Here are the filings I saw where the number of null values increased between filings over time. The timestamps starting with something like 165... are 2021 filings and 168... are 2022 filings.
Alaska_Electric_Light_and_Power_Company_form1_Q4_1650331752.xbrl:4
Alaska_Electric_Light_and_Power_Company_form1_Q4_1681860538.xbrl:193
Avista_Corporation_form1_Q4_1650073347.xbrl:22
Avista_Corporation_form1_Q4_1681871695.xbrl:31
ISO_New_England_Inc._form1_Q4_1650043587.xbrl:9
ISO_New_England_Inc._form1_Q4_1681504193.xbrl:25
Lockhart_Power_Company_form1_Q4_1647971588.xbrl:11
Lockhart_Power_Company_form1_Q4_1679619233.xbrl:13
Mckenzie_Electric_Cooperative,_Inc._form1_Q4_1650337193.xbrl:22
Mckenzie_Electric_Cooperative,_Inc._form1_Q4_1681892382.xbrl:23
Midcontinent_Independent_System_Operator,_Inc._form1_Q4_1650329591.xbrl:11
Midcontinent_Independent_System_Operator,_Inc._form1_Q4_1681766565.xbrl:8
Midcontinent_Independent_System_Operator,_Inc._form1_Q4_1681844537.xbrl:10
New_York_Transco,_LLC_form1_Q4_1650679051.xbrl:2
New_York_Transco,_LLC_form1_Q4_1681874702.xbrl:4
PJM_Interconnection,_L.L.C._form1_Q4_1649716496.xbrl:12
PJM_Interconnection,_L.L.C._form1_Q4_1681512803.xbrl:13
Southern_California_Edison_Company_form1_Q4_1649310757.xbrl:184
Southern_California_Edison_Company_form1_Q4_1659404061.xbrl:185
Southern_California_Edison_Company_form1_Q4_1680848437.xbrl:153
Startrans_IO,_LLC_form1_Q4_1650319617.xbrl:14
Startrans_IO,_LLC_form1_Q4_1681877357.xbrl:17
Startrans_IO,_LLC_form1_Q4_1689807014.xbrl:17
Wolverine_Power_Supply_Cooperative,_Inc._form1_Q4_1650333900.xbrl:10
Wolverine_Power_Supply_Cooperative,_Inc._form1_Q4_1651188465.xbrl:10
Wolverine_Power_Supply_Cooperative,_Inc._form1_Q4_1652991210.xbrl:13
Wolverine_Power_Supply_Cooperative,_Inc._form1_Q4_1668555358.xbrl:13
Wolverine_Power_Supply_Cooperative,_Inc._form1_Q4_1681851597.xbrl:11
> ag -c 'xsi:nil="true"' | sort
ALABAMA_POWER_COMPANY_form1_Q4_1650346677.xbrl:8
ALABAMA_POWER_COMPANY_form1_Q4_1681786461.xbrl:8
ALLETE,_Inc._form1_Q4_1650333678.xbrl:48
ALLETE,_Inc._form1_Q4_1681865705.xbrl:44
Alaska_Electric_Light_and_Power_Company_form1_Q4_1650331752.xbrl:4
Alaska_Electric_Light_and_Power_Company_form1_Q4_1681860538.xbrl:193
Ameren_Illinois_Company_form1_Q4_1649472192.xbrl:2
Ameren_Illinois_Company_form1_Q4_1662778537.xbrl:2
Ameren_Illinois_Company_form1_Q4_1681174555.xbrl:2
Ameren_Transmission_Company_of_Illinois_form1_Q4_1649471880.xbrl:2
Ameren_Transmission_Company_of_Illinois_form1_Q4_1662778046.xbrl:2
Ameren_Transmission_Company_of_Illinois_form1_Q4_1681269271.xbrl:2
American_Transmission_Company_LLC_form1_Q4_1650677382.xbrl:1162
American_Transmission_Company_LLC_form1_Q4_1681850636.xbrl:30
Avista_Corporation_form1_Q4_1650073347.xbrl:22
Avista_Corporation_form1_Q4_1681871695.xbrl:31
Basin_Electric_Power_Cooperative_form1_Q4_1649920868.xbrl:60
Basin_Electric_Power_Cooperative_form1_Q4_1681856053.xbrl:30
CENTRAL_HUDSON_GAS_&_ELECTRIC_CORPORATION_form1_Q4_1650324047.xbrl:1
CENTRAL_HUDSON_GAS_&_ELECTRIC_CORPORATION_form1_Q4_1681434108.xbrl:1
California_Independent_System_Operator_Corporation_form1_Q4_1650089940.xbrl:2
California_Independent_System_Operator_Corporation_form1_Q4_1681787688.xbrl:2
CenterPoint_Energy_Houston_Electric,_LLC_form1_Q4_1681507318.xbrl:2
Central_Hudson_Gas_&_Electric_form1_Q4_1650140284.xbrl:1
Citizens_Sunrise_Transmission_LLC_form1_Q4_1650054717.xbrl:2
Citizens_Sunrise_Transmission_LLC_form1_Q4_1681522045.xbrl:1
Citizens_Sycamore-Penasquitos_Transmission_LLC_form1_Q4_1650324086.xbrl:5
Citizens_Sycamore-Penasquitos_Transmission_LLC_form1_Q4_1681522122.xbrl:3
Connecticut_Yankee_Atomic_Power_Company_form1_Q4_1652466890.xbrl:1
Connecticut_Yankee_Atomic_Power_Company_form1_Q4_1683767371.xbrl:1
DATC_Path_15,_LLC_form1_Q4_1650336567.xbrl:2
DATC_Path_15,_LLC_form1_Q4_1651612870.xbrl:2
DATC_Path_15,_LLC_form1_Q4_1681509693.xbrl:2
Duke_Energy_Carolinas,_LLC_form1_Q4_1650332292.xbrl:2
Duke_Energy_Carolinas,_LLC_form1_Q4_1681509546.xbrl:2
Duke_Energy_Carolinas,_LLC_form1_Q4_1683749074.xbrl:2
Duke_Energy_Carolinas,_LLC_form1_Q4_1691775432.xbrl:2
Duke_Energy_Indiana,_LLC_form1_Q4_1650326662.xbrl:2
Duke_Energy_Indiana,_LLC_form1_Q4_1681783754.xbrl:2
Duke_Energy_Indiana,_LLC_form1_Q4_1683137115.xbrl:2
Duke_Energy_Kentucky,_Inc._form1_Q4_1650328445.xbrl:2
Duke_Energy_Kentucky,_Inc._form1_Q4_1653106736.xbrl:2
Duke_Energy_Kentucky,_Inc._form1_Q4_1681518600.xbrl:2
Duke_Energy_Kentucky,_Inc._form1_Q4_1682374583.xbrl:2
Fitchburg_Gas_and_Electric_Light_Company_form1_Q4_1648787356.xbrl:8
Fitchburg_Gas_and_Electric_Light_Company_form1_Q4_1679449968.xbrl:8
Fitchburg_Gas_and_Electric_Light_Company_form1_Q4_1679516751.xbrl:8
Georgia_Power_Company_form1_Q4_1649884766.xbrl:5
Georgia_Power_Company_form1_Q4_1680321812.xbrl:4
Georgia_Power_Company_form1_Q4_1682045206.xbrl:4
Georgia_Power_Company_form1_Q4_1688178807.xbrl:4
ISO_New_England_Inc._form1_Q4_1650043587.xbrl:9
ISO_New_England_Inc._form1_Q4_1681504193.xbrl:25
Idaho_Power_Company_form1_Q4_1650048652.xbrl:29
Idaho_Power_Company_form1_Q4_1663011551.xbrl:29
Idaho_Power_Company_form1_Q4_1681498784.xbrl:27
Lockhart_Power_Company_form1_Q4_1647971588.xbrl:11
Lockhart_Power_Company_form1_Q4_1679619233.xbrl:13
Madison_Gas_and_Electric_Company_form1_Q4_1650318769.xbrl:18
Madison_Gas_and_Electric_Company_form1_Q4_1681774992.xbrl:14
Maine_Yankee_Atomic_Power_Company_form1_Q4_1652481695.xbrl:1
Maine_Yankee_Atomic_Power_Company_form1_Q4_1652481987.xbrl:1
Mckenzie_Electric_Cooperative,_Inc._form1_Q4_1650337193.xbrl:22
Mckenzie_Electric_Cooperative,_Inc._form1_Q4_1681892382.xbrl:23
Midcontinent_Independent_System_Operator,_Inc._form1_Q4_1650329591.xbrl:11
Midcontinent_Independent_System_Operator,_Inc._form1_Q4_1681766565.xbrl:8
Midcontinent_Independent_System_Operator,_Inc._form1_Q4_1681844537.xbrl:10
Morongo_Transmission_LLC_form1_Q4_1652753079.xbrl:7
Morongo_Transmission_LLC_form1_Q4_1681866628.xbrl:3
Mountrail-Williams_Electric_Cooperative_form1_Q4_1681856070.xbrl:6
New_York_Transco,_LLC_form1_Q4_1650679051.xbrl:2
New_York_Transco,_LLC_form1_Q4_1681874702.xbrl:4
PJM_Interconnection,_L.L.C._form1_Q4_1649716496.xbrl:12
PJM_Interconnection,_L.L.C._form1_Q4_1681512803.xbrl:13
Portland_General_Electric_Company_form1_Q4_1649826303.xbrl:70
Portland_General_Electric_Company_form1_Q4_1650930041.xbrl:70
Portland_General_Electric_Company_form1_Q4_1681523942.xbrl:64
SOUTHERN_ELECTRIC_GENERATING_COMPANY_form1_Q4_1650338438.xbrl:2
SOUTHERN_ELECTRIC_GENERATING_COMPANY_form1_Q4_1650504486.xbrl:2
SOUTHERN_ELECTRIC_GENERATING_COMPANY_form1_Q4_1681815561.xbrl:1
San_Diego_Gas_&_Electric_Company_form1_Q4_1681804734.xbrl:2
Southern_California_Edison_Company_form1_Q4_1649310757.xbrl:184
Southern_California_Edison_Company_form1_Q4_1659404061.xbrl:185
Southern_California_Edison_Company_form1_Q4_1680848437.xbrl:153
Startrans_IO,_LLC_form1_Q4_1650319617.xbrl:14
Startrans_IO,_LLC_form1_Q4_1681877357.xbrl:17
Startrans_IO,_LLC_form1_Q4_1689807014.xbrl:17
Superior_Water_Light_&_Power_form1_Q4_1650333855.xbrl:10
Superior_Water_Light_&_Power_form1_Q4_1681865034.xbrl:9
UNION_ELECTRIC_COMPANY_form1_Q4_1649986899.xbrl:2
UNION_ELECTRIC_COMPANY_form1_Q4_1681357173.xbrl:2
Unitil_Energy_Systems,_Inc._form1_Q4_1648788001.xbrl:4
Unitil_Energy_Systems,_Inc._form1_Q4_1679450160.xbrl:4
Unitil_Energy_Systems,_Inc._form1_Q4_1679516714.xbrl:4
Upper_Michigan_Energy_Resources_Corporation_form1_Q4_1650071061.xbrl:36
Upper_Michigan_Energy_Resources_Corporation_form1_Q4_1681790826.xbrl:12
Upper_Missouri_G._&_T._Electric_Cooperative,_Inc._form1_Q4_1649721069.xbrl:16
Upper_Missouri_G._&_T._Electric_Cooperative,_Inc._form1_Q4_1680128870.xbrl:6
Versant_Power_form1_Q4_1650001412.xbrl:8
Versant_Power_form1_Q4_1680661938.xbrl:8
Versant_Power_form1_Q4_1685056664.xbrl:8
Versant_Power_form1_Q4_1689725909.xbrl:8
Wisconsin_Electric_Power_Company_form1_Q4_1650071326.xbrl:56
Wisconsin_Electric_Power_Company_form1_Q4_1681785089.xbrl:8
Wisconsin_Public_Service_Corporation_form1_Q4_1650072443.xbrl:14
Wisconsin_Public_Service_Corporation_form1_Q4_1681786108.xbrl:37
Wisconsin_River_Power_Company_form1_Q4_1650069359.xbrl:3
Wisconsin_River_Power_Company_form1_Q4_1681509642.xbrl:2
Wolverine_Power_Supply_Cooperative,_Inc._form1_Q4_1650333900.xbrl:10
Wolverine_Power_Supply_Cooperative,_Inc._form1_Q4_1651188465.xbrl:10
Wolverine_Power_Supply_Cooperative,_Inc._form1_Q4_1652991210.xbrl:13
Wolverine_Power_Supply_Cooperative,_Inc._form1_Q4_1668555358.xbrl:13
Wolverine_Power_Supply_Cooperative,_Inc._form1_Q4_1681851597.xbrl:11
Yankee_Atomic_Electric_Company_form1_Q4_1652480674.xbrl:8
Yankee_Atomic_Electric_Company_form1_Q4_1683824491.xbrl:1
We see a few fields dominating those null values:
> ag --nofilename 'xsi:nil="true"' | ag '<ferc:' | awk '{print $1}' | sort | uniq -c | sort -rn | head
743 <ferc:LengthForTransmissionLinesAggregatedWithOtherStructures
408 <ferc:LengthForStandAloneTransmissionLines
307 <ferc:BillingDemand
73 <ferc:GeneralInformationAboutPropertyOfTheRespondentHeldByReceiverOrTrusteeDescription
49 <ferc:DepreciablePlantBase
46 <ferc:UtilityPlantEstimatedAverageServiceLife
46 <ferc:OtherRegulatoryLiabilities
44 <ferc:UtilityPlantAppliedDepreciationRate
42 <ferc:ElectricPlantPropertyClassifiedAsHeldForFutureUseExpectedUseInServiceDate
39 <ferc:FootnoteReferences
``
Unfortunately, it seems like we don't have a super clean distribution to differentiate between "definitely a diff" and "definitely a snapshot" - here's the distribution of "number of non-null data values per (context, filing) pair" in the three problem tables. Here "data values" means "values in columns that are not the primary key."
And here is a similar distribution, but defining "data values" as "values in columns which have any NAs at all":
The order here is
[
"raw_ferc1_xbrl__accumulated_provision_for_depreciation_of_electric_utility_plant_functional_classification_section_b_219_instant",
"raw_ferc1_xbrl__summary_of_utility_plant_and_accumulated_provisions_for_depreciation_amortization_and_depletion_200_instant",
"raw_ferc1_xbrl__electric_energy_account_401a_duration",
]
So you can sort of see that the first and third tables might have a nice cutoff for determining diff-ness vs. snapshot-ness but the second one seems pretty arbitrary.
Upshot: we should dig in and see what values become non-null when using "apply diffs" vs. "best snapshot" vs. "last snapshot" - and then do a spot check of those with the raw data to see what happened there. If the spot check doesn't turn up new/surprising issues, and the number of "values that go from null to non-null because we missed an actively reported null value" isn't super high, then we should go ahead with the "apply diffs everywhere" approach.
Depending on how many of those potentially missed null values there are, we should also prioritize the ferc-to-XBRL refactor. That will probably take a good chunk of time - something on the order of 50 hours, since it radically changes the structure of our XBRL data which the rest of our XBRL transforms rely on.
It's kind of wild that there's so much complexity coming from these refilings / partial updates. @campbellpryde @austinmatherne-wk is this a normal thing for XBRL? Is there a canonical way that it's dealt with? Or is FERC somehow making this extra difficult?
To be fair, that’s sort of our fault for not distinguishing between non-reported values and reported null values in our SQLite conversion. Though the FERC filings might be weird in one or another way beyond that…
TL;DR: apply-diffs is the best short term solution. The medium-term step after that is moving the XBRL deduplication logic into our transformation framework so we can apply spot fixes. In the future we should be building better tools to help us quickly identify and investigate suspicious data because there will always be some level of manual spot fixing needed.
I compared best-snapshot with apply-diffs for 20 tables and found 55 values which went from null
in the best-snapshot case to a non-null value. I also found 295 values which were updated from one non-null value to another non-null value, which I assumed were legitimate updates. So, about 90% of the diff between the two favors apply-diffs.
For the first category, I largely relied on tracing the individual numeric values through the raw XBRL files.
Most of the changes were legitimate updates. There were a few instances where a non-null value for field A
was reported for similar-sounding field A'
in later reports. Most of the time, field A
was never updated with a null value. The only instance I found where a utility reports a null value to overwrite field A
was MISO. Good job MISO!
I had one question about some of the values, specifically in the raw_ferc1_xbrl__accumulated_provision_for_depreciation_of_electric_utility_plant_functional_classification_section_b_219_instant
table, that maybe @jrea-rmi could help answer.
We see a de-nullification in the accumulated_depreciation_distribution
value for Silver Run Electric for 2021 - which is 5,257,734.
That value shows up as the following facts in Silver Run filings:
Silver_Run_Electric,_Llc_form1_Q4_1650068815.xbrl (2021 filing 1) AccumulatedDepreciationDistribution AccumulatedProvisionForDepreciationOfElectricUtilityPlant DepreciationUtilityPlantInService
Silver_Run_Electric,_Llc_form1_Q4_1654217180.xbrl (2021 filing 2) AccumulatedDepreciationTransmission AccumulatedProvisionForDepreciationOfElectricUtilityPlant DepreciationUtilityPlantInService
Silver_Run_Electric,_Llc_form1_Q4_1681780102.xbrl (2022 filing 1) AccumulatedProvisionForDepreciationOfElectricUtilityPlant
Silver_Run_Electric,_Llc_form1_Q4_1687981013.xbrl (2022 filing 2) AccumulatedProvisionForDepreciationOfElectricUtilityPlant
So this looks like they initially reported that value as "distribution" but later updated it to "transmission." I suppose either report could be in error there, but maybe @jonrea you have better insight into which one is correct.
I spot checked the following other tables:
['raw_ferc1_xbrl__statement_of_income_114_duration',
'raw_ferc1_xbrl__other_regulatory_liabilities_account_254_278_instant',
'raw_ferc1_xbrl__steam_electric_generating_plant_statistics_large_plants_402_instant',
'raw_ferc1_xbrl__electric_operating_revenues_300_duration',
'raw_ferc1_xbrl__electric_energy_account_401a_duration',
'raw_ferc1_xbrl__summary_of_depreciation_and_amortization_charges_section_a_336_duration',
'raw_ferc1_xbrl__electric_plant_in_service_204_instant',
'raw_ferc1_xbrl__pumped_storage_generating_plant_statistics_large_plants_408_instant',
'raw_ferc1_xbrl__retained_earnings_118_instant',
'raw_ferc1_xbrl__statement_of_cash_flows_120_instant',
'raw_ferc1_xbrl__comparative_balance_sheet_liabilities_and_other_credits_110_instant',
'raw_ferc1_xbrl__accumulated_provision_for_depreciation_of_electric_utility_plant_functional_classification_section_b_219_instant',
'raw_ferc1_xbrl__electric_operations_and_maintenance_expenses_320_duration',
'raw_ferc1_xbrl__accumulated_provision_for_depreciation_of_electric_utility_plant_changes_section_a_219_duration',
'raw_ferc1_xbrl__electric_plant_in_service_204_duration',
'raw_ferc1_xbrl__generating_plant_statistics_410_instant',
'raw_ferc1_xbrl__comparative_balance_sheet_assets_and_other_debits_110_instant',
'raw_ferc1_xbrl__hydroelectric_generating_plant_statistics_large_plants_406_instant',
'raw_ferc1_xbrl__summary_of_utility_plant_and_accumulated_provisions_for_depreciation_amortization_and_depletion_200_instant',
'raw_ferc1_xbrl__accumulated_provision_for_depreciation_of_electric_utility_plant_changes_section_a_219_instant']
I think, before we tackle the FERC 2023 data in earnest, we should make some tools that identify suspicious values and help us quickly investigate them / turn them into spot fixes. But we don't need to do that now.
Silver Run Electric is a transmission company, so the 2021 filing 2 with value assigned to transmission is correct.
I agree with using apply-diffs short term, would want to go ahead with that as quickly as we can!