pudl icon indicating copy to clipboard operation
pudl copied to clipboard

Bug 509

Open knordback opened this issue 2 years ago • 19 comments

This is a proposed way of dealing with fields that are in the input data but we're no longer dropping in the eia923 transform() functions, but aren't in the corresponding SQLite tables. Basically, I'm just stripping out the columns after the harvesting step.

In ToT code, the fields are dropped so they never appear in the DataFrames. If we don't drop them, then leaving the corresponding columns in results in a crash when trying to populate the SQLite DB.

In this code, the relevant field I'm dealing with is utility_name_eia, which comes in as operator_name. This is in the generation_fuel_eia923 table, but gets duplicated in generation_fuel_nuclear_eia923. So it needs to be removed from both.

This approach feels very special-case-y. But maybe that's the nature of the beast. If the general approach seems okay and other currently-dropped fields need to be handled in the same way, then I'd probably at least pull the new code out into a separate function.

knordback avatar Feb 27 '23 04:02 knordback

Codecov Report

Patch and project coverage have no change.

Comparison is base (62eee76) 88.4% compared to head (eaa7615) 88.4%.

Additional details and impacted files
@@          Coverage Diff          @@
##             dev   #2333   +/-   ##
=====================================
  Coverage   88.4%   88.4%           
=====================================
  Files         87      87           
  Lines      10139   10139           
=====================================
  Hits        8971    8971           
  Misses      1168    1168           
Impacted Files Coverage Δ
src/pudl/transform/eia923.py 85.2% <ø> (ø)

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

codecov[bot] avatar Feb 27 '23 05:02 codecov[bot]

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@knordback I merged in dev which had some changes in the boilers_eia860 transform that overlapped w/ a change you had made (but that block of code was deleted so it was easy to resolve).

The merge also fixed some linting issues that were preventing the CI from passing.

Do you have the pre-commit hooks installed and running locally?

zaneselvans avatar Mar 21 '23 04:03 zaneselvans

If you've got the full DB already populated locally, you can run the full data tests and validations in parallel in two different windows with these commands from the main repo directory:

pytest --live-dbs test/validate
pytest --live-dbs --etl-settings src/pudl/package_data/settings/etl_full.yml test/integration

zaneselvans avatar May 25 '23 01:05 zaneselvans

With the current code I get this:

===================================================================================================== short test summary info ======================================================================================================
FAILED test/validate/eia_test.py::test_minmax_rows[eia_raw-bf_eia923-1427692-1427692-119611] - ValueError: bf_eia923: found 1428112 rows, expected 1427692. Off by 0.029%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_raw-frc_eia923-597000-244415-24065] - ValueError: frc_eia923: found 608494 rows, expected 597000. Off by 1.925%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_raw-gf_eia923-2687345-2687345-230149] - ValueError: gf_eia923: found 2690175 rows, expected 2687345. Off by 0.105%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_raw-pu_eia860-184745-184745-184745] - ValueError: pu_eia860: found 184831 rows, expected 184745. Off by 0.047%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_raw-utils_eia860-119388-119388-119388] - ValueError: utils_eia860: found 121464 rows, expected 119388. Off by 1.739%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_annual-bf_eia923-1427692-1427692-119611] - ValueError: bf_eia923: found 119646 rows, expected 119611. Off by 0.029%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_annual-frc_eia923-597000-244415-24065] - ValueError: frc_eia923: found 24285 rows, expected 24065. Off by 0.914%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_annual-gen_eia923-None-5171497-432570] - ValueError: gen_eia923: found 428497 rows, expected 432570. Off by -0.942%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_annual-gf_eia923-2687345-2687345-230149] - ValueError: gf_eia923: found 230344 rows, expected 230149. Off by 0.085%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_annual-pu_eia860-184745-184745-184745] - ValueError: pu_eia860: found 184831 rows, expected 184745. Off by 0.047%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_annual-utils_eia860-119388-119388-119388] - ValueError: utils_eia860: found 121464 rows, expected 119388. Off by 1.739%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_annual-hr_by_unit-362381-30340] - ValueError: hr_by_unit: found 30342 rows, expected 30340. Off by 0.007%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_annual-hr_by_gen-555119-46408] - ValueError: hr_by_gen: found 46410 rows, expected 46408. Off by 0.004%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_annual-fuel_cost-555119-46408] - ValueError: fuel_cost: found 46410 rows, expected 46408. Off by 0.004%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_annual-capacity_factor-5171497-432570] - ValueError: capacity_factor: found 428497 rows, expected 432570. Off by -0.942%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_annual-mcoe-5171881-432602] - ValueError: mcoe: found 428529 rows, expected 432602. Off by -0.942%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_monthly-bf_eia923-1427692-1427692-119611] - ValueError: bf_eia923: found 1428112 rows, expected 1427692. Off by 0.029%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_monthly-frc_eia923-597000-244415-24065] - ValueError: frc_eia923: found 246368 rows, expected 244415. Off by 0.799%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_monthly-gen_eia923-None-5171497-432570] - ValueError: gen_eia923: found 5122777 rows, expected 5171497. Off by -0.942%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_monthly-gf_eia923-2687345-2687345-230149] - ValueError: gf_eia923: found 2690175 rows, expected 2687345. Off by 0.105%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_monthly-pu_eia860-184745-184745-184745] - ValueError: pu_eia860: found 184831 rows, expected 184745. Off by 0.047%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_monthly-utils_eia860-119388-119388-119388] - ValueError: utils_eia860: found 121464 rows, expected 119388. Off by 1.739%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_monthly-hr_by_unit-362381-30340] - ValueError: hr_by_unit: found 362405 rows, expected 362381. Off by 0.007%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_monthly-hr_by_gen-555119-46408] - ValueError: hr_by_gen: found 555143 rows, expected 555119. Off by 0.004%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_monthly-fuel_cost-555119-46408] - ValueError: fuel_cost: found 555143 rows, expected 555119. Off by 0.004%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_monthly-capacity_factor-5171497-432570] - ValueError: capacity_factor: found 5122777 rows, expected 5171497. Off by -0.942%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_monthly-mcoe-5171881-432602] - ValueError: mcoe: found 5123161 rows, expected 5171881. Off by -0.942%, allowed margin of 0.000%
================================================================================ 27 failed, 35 passed, 6 skipped, 47 warnings in 2606.97s (0:43:26) ================================================================================
minmax_rows: exit 1 (2625.05 seconds) /home/knordback/Kurt/pudl> pytest --color=yes --live-dbs test/validate/epacamd_eia_test.py::test_minmax_rows test/validate/ferc1_test.py::test_minmax_rows test/validate/eia_test.py::test_minmax_rows test/validate/mcoe_test.py::test_minmax_rows_mcoe pid=3047649

Does this make sense ? Is there an easy way to look at the diffs?

knordback avatar May 26 '23 21:05 knordback

Most of these changes are small increases in the number of rows which seems pretty reasonable. Changes that sounds out as maybe funny:

  • Nearly 2% increase in the number of records in the fuel receipts and costs table (frc_eia923)
  • Nearly 2% increase in the number of utility records (utils_eia860)
  • Loss of almost 1% of the records in the generation table (gen_eia923) and the tables that depend on it directly.

@cmgosnell do you have any thoughts on why these tables would have such big changes just because we're not dropping "extra" columns before harvesting? The utilities make some sense, but the FRC and generation seem surprising.

To diff the tables, there's sqldiff but it looks like that may only be a Windows utility. I think @rousik or @zschira were doing some DB table diffing, maybe with another utility?

I would probably read the two tables into pandas and set the index to the primary key and see what records exist in one table but not the other by taking the difference between the two indexes. But that won't work well on the FRC table since it has no natural primary key.

zaneselvans avatar May 29 '23 15:05 zaneselvans

@knordback it looks like @rousik has been doing a bunch of work diffing SQL tables and this might be another good test of his little toolkit too.

zaneselvans avatar Jun 02 '23 00:06 zaneselvans

@knordback it looks like @rousik has been doing a bunch of work diffing SQL tables and this might be another good test of his little toolkit too.

Okay, that would be good. I get a lot of output just running sqldiff, and I don't know how to extract the essential difference.

knordback avatar Jun 02 '23 03:06 knordback

Do you have the outputs on gcs, or can you tell me how to repro the etl run in question? I could test my scripts against this case.

On Fri, Jun 2, 2023, 05:06 knordback @.***> wrote:

@knordback https://github.com/knordback it looks like @rousik https://github.com/rousik has been doing a bunch of work diffing SQL tables and this might be another good test of his little toolkit too.

Okay, that would be good. I get a lot of output just running sqldiff, and I don't know how to extract the essential difference.

— Reply to this email directly, view it on GitHub https://github.com/catalyst-cooperative/pudl/pull/2333#issuecomment-1573072540, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABYSCGGHM45YV6HFPBLDI5DXJFKC5ANCNFSM6AAAAAAVI5CXOE . You are receiving this because you were mentioned.Message ID: @.***>

rousik avatar Jun 02 '23 14:06 rousik

Do you have the outputs on gcs, or can you tell me how to repro the etl run in question? I could test my scripts against this case. On Fri, Jun 2, 2023, 05:06 knordback @.> wrote: @knordback https://github.com/knordback it looks like @rousik https://github.com/rousik has been doing a bunch of work diffing SQL tables and this might be another good test of his little toolkit too. Okay, that would be good. I get a lot of output just running sqldiff, and I don't know how to extract the essential difference. — Reply to this email directly, view it on GitHub <#2333 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABYSCGGHM45YV6HFPBLDI5DXJFKC5ANCNFSM6AAAAAAVI5CXOE . You are receiving this because you were mentioned.Message ID: @.>

I'm wanting to compare the pudl.sqlite produced on dev against that produced in the bug-509 branch. (I just merged from dev to the branch, so the diffs should only correspond to the branch changes.) I also have both locally and could put them on Google. Is there a particular place this kind of thing should go?

knordback avatar Jun 02 '23 15:06 knordback

All right, I have ran bug-509 branch vs dev and here's the output diff report:

RowCount(pudl.sqlite/denorm_boiler_fuel_eia923): 108
RowCount(pudl.sqlite/denorm_boiler_fuel_monthly_eia923): 108
RowCount(pudl.sqlite/denorm_boiler_fuel_yearly_eia923): 9
RowCount(pudl.sqlite/denorm_fuel_receipts_costs_eia923): 1390
RowCount(pudl.sqlite/denorm_fuel_receipts_costs_monthly_eia923): 248
RowCount(pudl.sqlite/denorm_fuel_receipts_costs_yearly_eia923): 32
RowCount(pudl.sqlite/denorm_generation_eia923): 24
RowCount(pudl.sqlite/denorm_generation_fuel_combined_eia923): 1300
RowCount(pudl.sqlite/denorm_generation_fuel_combined_monthly_eia923): 1300
RowCount(pudl.sqlite/denorm_generation_fuel_combined_yearly_eia923): 117
RowCount(pudl.sqlite/denorm_generation_monthly_eia923): 24
RowCount(pudl.sqlite/denorm_generation_yearly_eia923): 2
RowCount(pudl.sqlite/denorm_plants_utilities_eia): 65
RowCount(pudl.sqlite/denorm_utilities_eia): 7
RowCount(pudl.sqlite/utilities_eia860): 7
RowCount(pudl.sqlite/utilities_entity_eia): 1

The positive number means that that many rows were added in the branch bug-509 (right side of the comparison).

The relevant piece of code you can use to fetch the differing rows would be something along the lines of:

def read_table_as_df(db_path, table_name):
  con = create_engine(db_path)
  return pd.concat([
    df for df in pd.read_sql_table(table_name, con, chunksize=100_000
  ])

df_left = read_table_as_df(left_db, table) 
df_right = read_table_as_df(right_db, table)

df_merge = df_left.merge(df_right, how="outer", indicator=True)
# Then, df_merge["_merge"] has values "left_only", "right_only" or "both" depending on where it occurs

rousik avatar Jun 02 '23 18:06 rousik

Ah, this is super helpful. More rows in bug-509 is what we would expect. Which version of the ETL did you run to get this?

knordback avatar Jun 02 '23 18:06 knordback

Latest dev available at the time and latest commit of bug-509. rousik-output-diff branch has a tool "output_diff" you could point at two directories and it will produce the report for you. I'll be working on automating this.

rousik avatar Jun 05 '23 08:06 rousik

Huh, those differences in the number of rows seem pretty different from what @knordback got when he ran the minmax_rows tests. I wonder why?

zaneselvans avatar Jun 05 '23 15:06 zaneselvans

It depends on the ETL version used to create the pudl.sqlite, right? So I'm running modified and reference versions of etl_full_no_cems and will try @rousik 's tool on the output.

knordback avatar Jun 05 '23 16:06 knordback

If you've got two different DBs side by side and you want to see differences between the outputs they generate you can do something like the following in a Jupyter Notebook (here looking at the plant_in_service_ferc1 output table, which should have some differences between your DB and the last successful nightly build output from dev:

import pandas as pd
import sqlalchemy as sa

from pudl.metadata.classes import Resource
from pudl.output.pudltabl import PudlTabl

pk_cols = Resource.from_id("plant_in_service_ferc1").schema.primary_key

# Use the real paths to your 2 DBs obvs:
left_db_url = "sqlite:///" + "/Users/zane/code/catalyst/pudl-work/output/pudl.sqlite"
left_engine = sa.create_engine(left_db_url)
left_pudl_out = PudlTabl(left_engine)
left_df = pudl_out.plant_in_service_ferc1().set_index(pk_cols)

# Use the real paths to your 2 DBs obvs:
right_db_url = "sqlite:///" + "/Users/zane/code/catalyst/pudl-work/output/pudl.sqlite"
right_engine = sa.create_engine(right_db_url)
right_pudl_out = PudlTabl(right_engine)
right_df = pudl_out.plant_in_service_ferc1().set_index(pk_cols)

left_only_index = left_df.index.difference(right_df.index)
right_only_index = right_df.index.difference(left_df.index)

left_only_df = left_df.loc[left_only_index]
right_only_df = right_df.loc[right_only_index]

zaneselvans avatar Jun 10 '23 02:06 zaneselvans

Remarkably, I finally got back to this. I ran the bug-509 code and compared output against dev. To compare, I first ran Jan's output_diff tool to find pudl.sqlite tables that differ. It produced the following:

2023-06-25 17:37:04 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/advanced_metering_infrastructure_eia861.
2023-06-25 17:37:05 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/alembic_version.
2023-06-25 17:37:05 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/averaging_periods_eia.
2023-06-25 17:37:05 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/balance_sheet_assets_ferc1.
2023-06-25 17:37:06 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/balance_sheet_liabilities_ferc1.
2023-06-25 17:37:07 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/balancing_authorities_eia.
2023-06-25 17:37:07 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/balancing_authority_assn_eia861.
2023-06-25 17:37:07 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/balancing_authority_eia861.
2023-06-25 17:37:07 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/boiler_cooling_assn_eia860.
2023-06-25 17:37:07 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/boiler_emissions_control_equipment_assn_eia860.
2023-06-25 17:37:07 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/boiler_fuel_eia923.
2023-06-25 17:37:10 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/boiler_generator_assn_eia860.
2023-06-25 17:37:10 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/boiler_generator_assn_types_eia.
2023-06-25 17:37:10 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/boiler_stack_flue_assn_eia860.
2023-06-25 17:37:10 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/boiler_status_eia.
2023-06-25 17:37:10 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/boiler_types_eia.
2023-06-25 17:37:10 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/boilers_eia860.
2023-06-25 17:37:10 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/boilers_entity_eia.
2023-06-25 17:37:10 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/cash_flow_ferc1.
2023-06-25 17:37:12 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/coalmine_eia923.
2023-06-25 17:37:12 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/coalmine_types_eia.
2023-06-25 17:37:12 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/contract_types_eia.
2023-06-25 17:37:12 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/data_maturities.
2023-06-25 17:37:12 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/datasources.
2023-06-25 17:37:12 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/demand_hourly_pa_ferc714.
2023-06-25 17:37:49 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/demand_response_eia861.
2023-06-25 17:37:49 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/demand_response_water_heater_eia861.
2023-06-25 17:37:49 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/demand_side_management_ee_dr_eia861.
2023-06-25 17:37:49 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/demand_side_management_misc_eia861.
2023-06-25 17:37:49 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/demand_side_management_sales_eia861.
2023-06-25 17:37:49 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_balance_sheet_assets_ferc1.
2023-06-25 17:37:50 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_balance_sheet_liabilities_ferc1.
2023-06-25 17:37:51 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_boiler_fuel_eia923.
RowCount(pudl.sqlite/denorm_boiler_fuel_eia923): 420
2023-06-25 17:37:54 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_boiler_fuel_monthly_eia923.
RowCount(pudl.sqlite/denorm_boiler_fuel_monthly_eia923): 420
2023-06-25 17:37:57 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_boiler_fuel_yearly_eia923.
RowCount(pudl.sqlite/denorm_boiler_fuel_yearly_eia923): 35
2023-06-25 17:37:58 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_boilers_eia.
2023-06-25 17:37:58 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_cash_flow_ferc1.
2023-06-25 17:37:59 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_depreciation_amortization_summary_ferc1.
2023-06-25 17:38:00 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_electric_energy_dispositions_ferc1.
2023-06-25 17:38:00 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_electric_energy_sources_ferc1.
2023-06-25 17:38:00 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_electric_operating_expenses_ferc1.
2023-06-25 17:38:02 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_electric_operating_revenues_ferc1.
2023-06-25 17:38:03 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_electric_plant_depreciation_changes_ferc1.
2023-06-25 17:38:04 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_electric_plant_depreciation_functional_ferc1.
2023-06-25 17:38:04 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_electricity_sales_by_rate_schedule_ferc1.
2023-06-25 17:38:04 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_emissions_control_equipment_eia860.
2023-06-25 17:38:04 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_fuel_by_plant_ferc1.
2023-06-25 17:38:04 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_fuel_ferc1.
2023-06-25 17:38:05 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_fuel_receipts_costs_eia923.
RowCount(pudl.sqlite/denorm_fuel_receipts_costs_eia923): 11494
2023-06-25 17:38:05 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_fuel_receipts_costs_monthly_eia923.
RowCount(pudl.sqlite/denorm_fuel_receipts_costs_monthly_eia923): 1953
2023-06-25 17:38:05 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_fuel_receipts_costs_yearly_eia923.
RowCount(pudl.sqlite/denorm_fuel_receipts_costs_yearly_eia923): 220
2023-06-25 17:38:05 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_generation_eia923.
RowCount(pudl.sqlite/denorm_generation_eia923): 264
2023-06-25 17:38:06 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_generation_fuel_combined_eia923.
RowCount(pudl.sqlite/denorm_generation_fuel_combined_eia923): 2794
2023-06-25 17:38:10 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_generation_fuel_combined_monthly_eia923.
RowCount(pudl.sqlite/denorm_generation_fuel_combined_monthly_eia923): 2794
2023-06-25 17:38:13 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_generation_fuel_combined_yearly_eia923.
RowCount(pudl.sqlite/denorm_generation_fuel_combined_yearly_eia923): 192
2023-06-25 17:38:13 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_generation_monthly_eia923.
RowCount(pudl.sqlite/denorm_generation_monthly_eia923): 264
2023-06-25 17:38:15 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_generation_yearly_eia923.
RowCount(pudl.sqlite/denorm_generation_yearly_eia923): 22
2023-06-25 17:38:15 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_generators_eia.
2023-06-25 17:38:15 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_income_statement_ferc1.
2023-06-25 17:38:17 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_other_regulatory_liabilities_ferc1.
2023-06-25 17:38:17 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_ownership_eia860.
2023-06-25 17:38:17 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_plant_in_service_ferc1.
2023-06-25 17:38:19 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_plants_all_ferc1.
2023-06-25 17:38:19 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_plants_eia.
2023-06-25 17:38:19 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_plants_hydro_ferc1.
2023-06-25 17:38:19 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_plants_pumped_storage_ferc1.
2023-06-25 17:38:19 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_plants_small_ferc1.
2023-06-25 17:38:19 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_plants_steam_ferc1.
2023-06-25 17:38:19 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_plants_utilities_eia.
RowCount(pudl.sqlite/denorm_plants_utilities_eia): 83
2023-06-25 17:38:20 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_plants_utilities_ferc1.
2023-06-25 17:38:20 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_purchased_power_ferc1.
2023-06-25 17:38:20 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_retained_earnings_ferc1.
2023-06-25 17:38:20 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_transmission_statistics_ferc1.
2023-06-25 17:38:21 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_utilities_eia.
RowCount(pudl.sqlite/denorm_utilities_eia): 2076
2023-06-25 17:38:21 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_utility_plant_summary_ferc1.
2023-06-25 17:38:22 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/depreciation_amortization_summary_ferc1.
2023-06-25 17:38:23 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/distributed_generation_fuel_eia861.
2023-06-25 17:38:23 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/distributed_generation_misc_eia861.
2023-06-25 17:38:23 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/distributed_generation_tech_eia861.
2023-06-25 17:38:23 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/distribution_systems_eia861.
2023-06-25 17:38:23 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/dynamic_pricing_eia861.
2023-06-25 17:38:23 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/electric_energy_dispositions_ferc1.
2023-06-25 17:38:23 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/electric_energy_sources_ferc1.
2023-06-25 17:38:23 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/electric_operating_expenses_ferc1.
2023-06-25 17:38:25 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/electric_operating_revenues_ferc1.
2023-06-25 17:38:25 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/electric_plant_depreciation_changes_ferc1.
2023-06-25 17:38:26 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/electric_plant_depreciation_functional_ferc1.
2023-06-25 17:38:27 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/electricity_sales_by_rate_schedule_ferc1.
2023-06-25 17:38:27 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/emissions_control_equipment_eia860.
2023-06-25 17:38:27 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/energy_efficiency_eia861.
2023-06-25 17:38:27 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/energy_sources_eia.
2023-06-25 17:38:27 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/environmental_equipment_manufacturers_eia.
2023-06-25 17:38:27 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/epacamd_eia.
2023-06-25 17:38:27 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/epacamd_eia_subplant_ids.
2023-06-25 17:38:27 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/ferc_accounts.
2023-06-25 17:38:27 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/firing_types_eia.
2023-06-25 17:38:27 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/fuel_ferc1.
2023-06-25 17:38:27 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/fuel_receipts_costs_aggs_eia.
2023-06-25 17:38:28 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/fuel_receipts_costs_eia923.
2023-06-25 17:38:28 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/fuel_transportation_modes_eia.
2023-06-25 17:38:28 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/fuel_types_aer_eia.
2023-06-25 17:38:28 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/generation_eia923.
2023-06-25 17:38:29 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/generation_fuel_eia923.
2023-06-25 17:38:35 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/generation_fuel_nuclear_eia923.
2023-06-25 17:38:35 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/generators_eia860.
2023-06-25 17:38:36 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/generators_entity_eia.
2023-06-25 17:38:36 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/green_pricing_eia861.
2023-06-25 17:38:36 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/income_statement_ferc1.
2023-06-25 17:38:38 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/mercury_compliance_strategies_eia.
2023-06-25 17:38:38 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/mergers_eia861.
2023-06-25 17:38:38 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/momentary_interruptions_eia.
2023-06-25 17:38:38 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/net_metering_customer_fuel_class_eia861.
2023-06-25 17:38:38 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/net_metering_misc_eia861.
2023-06-25 17:38:38 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/non_net_metering_customer_fuel_class_eia861.
2023-06-25 17:38:38 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/non_net_metering_misc_eia861.
2023-06-25 17:38:38 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/nox_compliance_strategies_eia.
2023-06-25 17:38:38 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/nox_control_status_eia.
2023-06-25 17:38:38 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/nox_units_eia.
2023-06-25 17:38:38 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/operational_data_misc_eia861.
2023-06-25 17:38:38 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/operational_data_revenue_eia861.
2023-06-25 17:38:38 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/operational_status_eia.
2023-06-25 17:38:38 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/other_regulatory_liabilities_ferc1.
2023-06-25 17:38:38 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/ownership_eia860.
2023-06-25 17:38:39 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/particulate_compliance_strategies_eia.
2023-06-25 17:38:39 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/particulate_units_eia.
2023-06-25 17:38:39 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/plant_in_service_ferc1.
2023-06-25 17:38:40 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/plants_eia.
2023-06-25 17:38:40 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/plants_eia860.
2023-06-25 17:38:40 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/plants_entity_eia.
2023-06-25 17:38:41 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/plants_ferc1.
2023-06-25 17:38:41 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/plants_hydro_ferc1.
2023-06-25 17:38:41 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/plants_pudl.
2023-06-25 17:38:41 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/plants_pumped_storage_ferc1.
2023-06-25 17:38:41 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/plants_small_ferc1.
2023-06-25 17:38:41 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/plants_steam_ferc1.
2023-06-25 17:38:41 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/political_subdivisions.
2023-06-25 17:38:41 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/power_purchase_types_ferc1.
2023-06-25 17:38:41 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/prime_movers_eia.
2023-06-25 17:38:41 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/purchased_power_ferc1.
2023-06-25 17:38:41 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/regulations_eia.
2023-06-25 17:38:41 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/reliability_eia861.
2023-06-25 17:38:41 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/reporting_frequencies_eia.
2023-06-25 17:38:41 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/respondent_id_ferc714.
2023-06-25 17:38:41 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/retained_earnings_ferc1.
2023-06-25 17:38:41 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/sales_eia861.
2023-06-25 17:38:43 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/sector_consolidated_eia.
2023-06-25 17:38:43 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/service_territory_eia861.
2023-06-25 17:38:43 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/so2_compliance_strategies_eia.
2023-06-25 17:38:43 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/so2_units_eia.
2023-06-25 17:38:43 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/steam_plant_types_eia.
2023-06-25 17:38:43 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/transmission_statistics_ferc1.
2023-06-25 17:38:44 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/utilities_eia.
2023-06-25 17:38:44 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/utilities_eia860.
RowCount(pudl.sqlite/utilities_eia860): 2076
2023-06-25 17:38:44 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/utilities_entity_eia.
RowCount(pudl.sqlite/utilities_entity_eia): 1
2023-06-25 17:38:44 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/utilities_ferc1.
2023-06-25 17:38:44 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/utilities_ferc1_dbf.
2023-06-25 17:38:44 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/utilities_ferc1_xbrl.
2023-06-25 17:38:44 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/utilities_pudl.
2023-06-25 17:38:44 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/utility_assn_eia861.
2023-06-25 17:38:44 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/utility_data_misc_eia861.
2023-06-25 17:38:44 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/utility_data_nerc_eia861.
2023-06-25 17:38:44 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/utility_data_rto_eia861.
2023-06-25 17:38:44 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/utility_plant_assn.
2023-06-25 17:38:44 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/utility_plant_summary_ferc1.
2023-06-25 17:38:45 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/wet_dry_bottom_eia.

(It also found differences in ferc2.sqlite, but I think that those occur because resource constraints on my machine meant that not everything could run.)

Anyhow, in the output above I chose table differences that I felt pretty confident I could identify in PudlTabl. I started with denorm_plants_utlities_eia, which appears in PudlTabl and apparently is accessed via pu_eia860(), and put them into the Jupyter code above. I then get the output below. First, the right-versus-left difference in rows is 83, which is the same thing output_diff gave, so that much is a good sign. But it means there are rows that appear in the bug-509 output that don't appear in dev. That is not what I would expect. Second, some of the rows have changed. E.g., the first row of left_only_df corresponds to the second row of right_only_df, with the same plant IDs and plant name. But the utility ID changes. Since utility_id_eia is one of the fields that's no longer dropped, it makes sense that this would change. But I would have expected it to take on some null value in dev code, which doesn't seem to be the case. Moreover, the utlity_name_eia changes (from American Electric Power Co Inc to Public Service Co of Oklahoma). A quick search indicates that the former is correct: https://en.wikipedia.org/wiki/Oklaunion_Power_Plant

So this is all worrisome and it appears maybe something is going wrong. It would be good to get a quick check of this to see if maybe this is better than it appears to me, of if I need to investigate more deeply. image

knordback avatar Jun 26 '23 18:06 knordback

I'm not sure this is necessarily concerning. You're comparing the left_only and right_only dataframes which must, by definition, not match in some way. I don't think that the ordering is necessarily deterministic, unless you're sorting by index / value at some point.

I think it makes sense to start with one of the "base" tables rather than derived tables that are several steps downstream, since the tables that are the immediate outputs from the harvesting are going to be the first places that the consequences of the new process appear, and everything downstream will be a result of those changes.

The outputs of the harvesting process are the entity and annual tables for plants, generators, boilers, and utilities:

  • plants_entity_eia
  • plants_eia860
  • generators_entity_eia
  • generators_eia860
  • boilers_entity_eia
  • boilers_eia860
  • utilities_entity_eia
  • utilities_eia860

You might also want to pull the data directly from the database(s) rather than using PudlTabl just to make sure that there's nothing in that software layer that's causing a discrepancy.

zaneselvans avatar Jun 28 '23 18:06 zaneselvans

Here's a selection of the differences I'm seeing. This is not comprehensive, as some of the tables have many differences and I haven't looked at all of them.

  • plants_entity_eia:
    • mostly changes in plants_entity_eia: many from <NA> to something that looks valid; others seem like just different versions of the name
  • plants_eia860:
    • utility_id_eia value 15143 replaced with <NA>
    • nerc_region value SERC replaced with WECC
    • in general, lots and lots of changes to nerc_region, some of which seem like improvements, some which seem like regressions
  • generators_entity_eia:
    • no differences
  • generators_eia860:
    • seems to all be changes in utility_id_eia value from <NA> to something credible-looking
  • boilers_entity_eia:
    • no differences
  • boilers_eia860:
    • no differences
  • utilities_entity_eia:
    • one additional entry in the table, that looks generally credible
    • also lots of changes to utility_name_eia
  • utilities_eia860:
    • 2076 additional entries, all with what look like reasonable utility_id_eia and report_date values, and all other entries <NA>

The following files show the differences. (I should have named them better: *-left.txt come from dev; *-right.txt come from bug-509 branch.) These are generated with code that loads the created pudl.sqlite files, extracts the referenced tables, converts to two DataFrames, and then differences series-by-series and outputs as text the differences. The diffs are viewable by comparing foo-left.txt and foo-right.txt using a visual differencer (tkdiff, meld, etc.) utilities_eia860-right.txt utilities_eia860-left.txt utilities_entity_eia-right.txt utilities_entity_eia-left.txt generators_eia860-right.txt generators_eia860-left.txt plants_entity_eia-right.txt plants_entity_eia-left.txt plants_eia860-right.txt plants_eia860-left.txt

knordback avatar Jul 11 '23 17:07 knordback