This PR follows issue #1830 for adding the ability to publish all raw FERC XBRL based SQLite databases to datasette.

Tasks

Create SQLite databases from PUDL

[ ] Create new script for generating the raw SQLite databases directly from PUDL
[ ] Update datastore to work with all of the FERC forms. The datastore for working with FERC Form 1 XBRL data should work for this with some minor updates

Ingest Metadata generated by extraction tool

[ ] The FERC XBRL Extractor can generate a Frctionless Data Package using metadata extracted from the FERC taxonomy. This will us to publish each database with column level descriptions provided by FERC

Enable publication

[ ] Integrate new sources with datasette_metadata_to_yml
[ ] Update datasette publication bash script

Aug 08 '22 17:08 zschira

I think rather than having two scripts for creating FERC SQLite databases (the current for FERC 1, and adding another for other forms), it would make sense to simply create one ferc_to_sqlite script. This could also involve creating a single settings object called FercToSqliteSettings. This way, if you're trying to run the ETL and only interested in FERC 1 data, you can specify that in the settings, and the script will ignore other forms. This also seems like a change that will make further integrating non-form 1 data easier going forward.

Aug 09 '22 19:08 zschira

Codecov Report

Merging #1831 (5729e4a) into xbrl_integration (7fda070) will increase coverage by 0.3%. The diff coverage is 91.1%.

@@                Coverage Diff                 @@
##           xbrl_integration   #1831     +/-   ##
==================================================
+ Coverage              83.3%   83.6%   +0.3%     
==================================================
  Files                    65      66      +1     
  Lines                  7418    7527    +109     
==================================================
+ Hits                   6180    6295    +115     
+ Misses                 1238    1232      -6

Impacted Files	Coverage Δ
src/pudl/metadata/sources.py	`100.0% <ø> (ø)`
src/pudl/workspace/datastore.py	`69.6% <ø> (+1.5%)`	:arrow_up:
src/pudl/convert/ferc_to_sqlite.py	`60.0% <12.5%> (ø)`
src/pudl/convert/datasette_metadata_to_yml.py	`68.4% <25.0%> (-31.6%)`	:arrow_down:
src/pudl/extract/xbrl.py	`95.7% <95.7%> (ø)`
src/pudl/extract/ferc1.py	`86.7% <100.0%> (-0.8%)`	:arrow_down:
src/pudl/metadata/classes.py	`82.4% <100.0%> (+0.2%)`	:arrow_up:
src/pudl/settings.py	`96.1% <100.0%> (+0.9%)`	:arrow_up:
src/pudl/workspace/setup.py	`83.1% <100.0%> (+3.1%)`	:arrow_up:
... and 4 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

Aug 24 '22 15:08 codecov[bot]

Are none of the tasks listed in this issue done? Or are you just not checking off the boxes as you go?

Also is the PR name correct? Is this really not to be merged until it gets all the way to being able to deploy the XBRL derived SQLite DBs to Datasette? Should it be a draft PR instead?

What issues are associated with the PR? That seems like a lot of work, which could/should be encapsulated in several issues at least. Is it really just #1861 ?

Sep 01 '22 04:09 zaneselvans

@zschira it looks like the tests are failing due to a settings parsing issue.

Sep 01 '22 18:09 zaneselvans

@zschira The tests might be failing because the main() function in datasette_metadata_to_yml doesn't return the integer 0 (zero) when it succeeds. It looks like it returns None, and IIRC only sys.exit(0) is interpreted as success by the CLI tests (and Unix CLI tools in general)

Sep 03 '22 01:09 zaneselvans

@zaneselvans the python docs seems to indicate that passing None to sys.exit will be treated the same as 0, so I'm not sure that's it unfortunately

Sep 03 '22 17:09 zschira

It also seems like the datasette_metadata_to_yml script is now the first test to trigger the PUDL ETL by depending indirectly on pudl_engine somehow. Maybe there was an implicit dependency before that worked because of the order things were running in, but now it doesn't work because that other thing (like generating the FERC1 DBF derived DB?) doesn't happen until later on?

Sep 03 '22 18:09 zaneselvans

When I run the tests locally I'm able to reproduce the failure that seems to be showing up in the CI. I get this error output:

script_runner = <ScriptRunner inprocess>
pudl_settings_fixture = {'censusdp1tract_db': 'sqlite:////tmp/pytest-of-zane/pytest-11/pudl0/sqlite/censusdp1tract.sqlite', 'data_dir': '/home...0/sqlite/ferc1.sqlite', 'ferc1_xbrl_db': 'sqlite:////tmp/pytest-of-zane/pytest-11/pudl0/sqlite/ferc1_xbrl.sqlite', ...}
ferc1_xbrl_engine = Engine(sqlite:////tmp/pytest-of-zane/pytest-11/pudl0/sqlite/ferc1_xbrl.sqlite)

    @pytest.mark.script_launch_mode("inprocess")
    def test_datasette_metadata_script(
        script_runner, pudl_settings_fixture, ferc1_xbrl_engine
    ):
        """Run datasette_metadata_to_yml for testing."""
        metadata_yml = Path(pudl_settings_fixture["pudl_out"], "metadata.yml")
        logger.info(f"Writing Datasette Metadata to {metadata_yml}")
    
        ret = script_runner.run(
            "datasette_metadata_to_yml",
            "-o",
            str(metadata_yml),
            print_result=False,
        )
>       assert ret.success
E       assert False
E        +  where False = <pytest_console_scripts.RunResult object at 0x7f6eac335840>.success

I wondered whether running the script at the command line would give a different result, so I naively just ran it like:

datasette_metadata_to_yml -o dude.yml

and unsurprisingly I got an error about there being no datapacakge.json file where the script looked:

Traceback (most recent call last):
  File "/home/zane/mambaforge/envs/pudl-dev/bin/datasette_metadata_to_yml", line 33, in <module>
    sys.exit(load_entry_point('catalystcoop.pudl', 'console_scripts', 'datasette_metadata_to_yml')())
  File "/home/zane/code/catalyst/pudl/src/pudl/convert/datasette_metadata_to_yml.py", line 46, in main
    dm = DatasetteMetadata.from_data_source_ids(pudl_settings=pudl_settings)
  File "/home/zane/code/catalyst/pudl/src/pudl/metadata/classes.py", line 1876, in from_data_source_ids
    with open(pudl_settings[f"{xbrl_id}_descriptor"]) as f:
FileNotFoundError: [Errno 2] No such file or directory: '/home/zane/code/catalyst/pudl-work/sqlite/ferc1_xbrl_descriptor.json'
[1]    643141 exit 1     datasette_metadata_to_yml -o dude.yml

So I decided to look in the test directories, and see if it had generated the expected datapackage.json files there.

Looking in /tmp/pytest-of-zane/pytest-current I saw the following links and directories, which seemed a little bit weird:

./  ../  pudl0/  pudlcurrent@  test_datasette_metadata_script0/  test_datasette_metadata_scriptcurrent@

Usually there's only directories of the fomr pudlN/ and pudlcurrent@ which points at the directory with the biggest N, and everything that's been output by the PUDL tests is in there.

Note that there's nothing but empty directories under the test_datasette_metadata_script0 directory:

find ./test_datasette_metadata_script0
./test_datasette_metadata_script0
./test_datasette_metadata_script0/script-cwd

The other scripts that are tested in a way that's similar to this one in the test/integration/console_scripts_test.py module don't produce a directory like this in their outputs. Also, note that given the way this test module generates the list of tests to run, it should attempt to run datasette_metadata_to_yml --help (since it just iterates through all of the entrypoints that are defined in setup.py):

"""Test the PUDL console scripts from within PyTest."""

import pkg_resources
import pytest

# Obtain a list of all deployed entry point scripts to test:
PUDL_SCRIPTS = [
    ep.name
    for ep in pkg_resources.iter_entry_points("console_scripts")
    if ep.module_name.startswith("pudl")
]


@pytest.mark.parametrize("script_name", PUDL_SCRIPTS)
@pytest.mark.script_launch_mode("inprocess")
def test_pudl_scripts(script_runner, script_name):
    """Run each console script in --help mode for testing."""
    ret = script_runner.run(script_name, "--help", print_result=False)
    assert ret.success

Those tests seem to run just fine and all pass:

test/integration/console_scripts_test.py::test_pudl_scripts[inprocess-censusdp1tract_to_sqlite] PASSED
test/integration/console_scripts_test.py::test_pudl_scripts[inprocess-datasette_metadata_to_yml] PASSED
test/integration/console_scripts_test.py::test_pudl_scripts[inprocess-epacems_to_parquet] PASSED
test/integration/console_scripts_test.py::test_pudl_scripts[inprocess-ferc_to_sqlite] PASSED
test/integration/console_scripts_test.py::test_pudl_scripts[inprocess-metadata_to_rst] PASSED
test/integration/console_scripts_test.py::test_pudl_scripts[inprocess-pudl_datastore] PASSED
test/integration/console_scripts_test.py::test_pudl_scripts[inprocess-pudl_etl] PASSED
test/integration/console_scripts_test.py::test_pudl_scripts[inprocess-pudl_setup] PASSED
test/integration/console_scripts_test.py::test_pudl_scripts[inprocess-state_demand] PASSED

And they do not seem to result in any script-specific directories being created.

The very next test that runs is the standalone datasette_metadata_test:

test/integration/datasette_metadata_test.py::test_datasette_metadata_script[inprocess]

And it does create that additional script-specific output directory under /tmp/pytest-of-zane/... that's separate from the rest of the pudl test outputs are going.

From the pytest-console-scripts documentation it looks like there's an argument to script_runner.run() (cwd) that lets you set the current working directory of the script when it's run. It doesn't look like either of us are passing it that argument, but in the iterative script tests, we're really just testing whether they import everything, have no syntax errors, etc. so it doesn't matter where it runs. While in the new test it's actually trying to write a file and access other files that have previously been output by the ETL process, it does need to know where it is in the filesystem? Or where to find the datapackage.json files which have been output alongside the SQLite files.

However, because you're calling the script directly at the command line (virtually, inside the tests) rather than calling the function / method you want to test and passing in the test-environment specific pudl_settings dictionary, which has the paths to the temporary directories created by pytest, it's generating its own incorrect set of default paths, and using those to try and look for the datapackage.json file next to the SQLite file, and it finds no such file and the script fails.

Switching the script style test out for one that uses the metadata generation script directly:

def test_datasette_metadata_to_yml(pudl_settings_fixture, ferc1_xbrl_engine):
    """Run datasette_metadata_to_yml for testing."""
    metadata_yml = Path(pudl_settings_fixture["pudl_out"], "metadata.yml")
    logger.info(f"Writing Datasette Metadata to {metadata_yml}")

    dm = DatasetteMetadata.from_data_source_ids(pudl_settings=pudl_settings_fixture)
    dm.to_yaml(path=metadata_yml)

    logger.info("Parsing generated metadata using datasette utils.")
    metadata_json = json.dumps(yaml.safe_load(metadata_yml.open()))
    parsed_metadata = datasette.utils.parse_metadata(metadata_json)
    assert set(parsed_metadata["databases"]) == {"pudl", "ferc1"}
    assert parsed_metadata["license"] == "CC-BY-4.0"
    assert (
        parsed_metadata["databases"]["pudl"]["source_url"]
        == "https://github.com/catalyst-cooperative/pudl"
    )
    assert (
        parsed_metadata["databases"]["pudl"]["tables"]["plants_entity_eia"][
            "label_column"
        ]
        == "plant_name_eia"
    )
    for tbl_name in parsed_metadata["databases"]["pudl"]["tables"]:
        assert (
            parsed_metadata["databases"]["pudl"]["tables"][tbl_name]["columns"]
            is not None
        )

It gets to the point of trying to construct the DatasetteMetadata instance and fails, with several hundred Pydantic validation errors. It seems like it's working with real data though, and getting some types it doesn't expect (like duration) and some extra fields in resource definitions that are prohibited by the Pydanic model. A sampling:

E     extra fields not permitted (type=value_error.extra)
E   resources -> 306 -> schema -> fields -> 9 -> type
E     unexpected value; permitted: 'string', 'number', 'integer', 'boolean', 'date', 'datetime', 'year' (type=value_error.const; given=duration; permitted=('string', 'number', 'integer', 'boolean', 'date', 'datetime', 'year'))
E   resources -> 306 -> schema -> fields -> 12 -> type
E     unexpected value; permitted: 'string', 'number', 'integer', 'boolean', 'date', 'datetime', 'year' (type=value_error.const; given=duration; permitted=('string', 'number', 'integer', 'boolean', 'date', 'datetime', 'year'))
E   resources -> 306 -> encoding
E     extra fields not permitted (type=value_error.extra)
E   resources -> 306 -> sqlite
E     extra fields not permitted (type=value_error.extra)
E   resources -> 307 -> encoding
E     extra fields not permitted (type=value_error.extra)
E   resources -> 307 -> sqlite

And the final pytest error:

FAILED test/integration/datasette_metadata_test.py::test_datasette_metadata_to_yml - pydantic.error_wrappers.ValidationError: 734 validation errors for Package

Sep 07 '22 02:09 zaneselvans

@zaneselvans this is very helpful, thanks! Those validation errors errors are showing up because I changed the ferc-xbrl-extractor dependency to be off a git branch to test something, but I've changed that back now, and I believe everything should be fixed now

Sep 07 '22 14:09 zschira

🎉

Sep 07 '22 15:09 zaneselvans

pudl
pudl copied to clipboard

WIP: Publish raw FERC XBRL DB's to datasette

Tasks

Create SQLite databases from PUDL

Ingest Metadata generated by extraction tool

Enable publication

Codecov Report

pudl pudl copied to clipboard

WIP: Publish raw FERC XBRL DB's to datasette

Tasks

Create SQLite databases from PUDL

Ingest Metadata generated by extraction tool

Enable publication

Codecov Report

pudl
pudl copied to clipboard