pudl
pudl copied to clipboard
WIP: Publish raw FERC XBRL DB's to datasette
This PR follows issue #1830 for adding the ability to publish all raw FERC XBRL based SQLite databases to datasette.
Tasks
Create SQLite databases from PUDL
- [ ] Create new script for generating the raw SQLite databases directly from PUDL
- [ ] Update datastore to work with all of the FERC forms. The datastore for working with FERC Form 1 XBRL data should work for this with some minor updates
Ingest Metadata generated by extraction tool
- [ ] The FERC XBRL Extractor can generate a Frctionless Data Package using metadata extracted from the FERC taxonomy. This will us to publish each database with column level descriptions provided by FERC
Enable publication
- [ ] Integrate new sources with
datasette_metadata_to_yml
- [ ] Update datasette publication bash script
I think rather than having two scripts for creating FERC SQLite databases (the current for FERC 1, and adding another for other forms), it would make sense to simply create one ferc_to_sqlite
script. This could also involve creating a single settings object called FercToSqliteSettings
. This way, if you're trying to run the ETL and only interested in FERC 1 data, you can specify that in the settings, and the script will ignore other forms. This also seems like a change that will make further integrating non-form 1 data easier going forward.
Codecov Report
Merging #1831 (5729e4a) into xbrl_integration (7fda070) will increase coverage by
0.3%
. The diff coverage is91.1%
.
@@ Coverage Diff @@
## xbrl_integration #1831 +/- ##
==================================================
+ Coverage 83.3% 83.6% +0.3%
==================================================
Files 65 66 +1
Lines 7418 7527 +109
==================================================
+ Hits 6180 6295 +115
+ Misses 1238 1232 -6
Impacted Files | Coverage Δ | |
---|---|---|
src/pudl/metadata/sources.py | 100.0% <ø> (ø) |
|
src/pudl/workspace/datastore.py | 69.6% <ø> (+1.5%) |
:arrow_up: |
src/pudl/convert/ferc_to_sqlite.py | 60.0% <12.5%> (ø) |
|
src/pudl/convert/datasette_metadata_to_yml.py | 68.4% <25.0%> (-31.6%) |
:arrow_down: |
src/pudl/extract/xbrl.py | 95.7% <95.7%> (ø) |
|
src/pudl/extract/ferc1.py | 86.7% <100.0%> (-0.8%) |
:arrow_down: |
src/pudl/metadata/classes.py | 82.4% <100.0%> (+0.2%) |
:arrow_up: |
src/pudl/settings.py | 96.1% <100.0%> (+0.9%) |
:arrow_up: |
src/pudl/workspace/setup.py | 83.1% <100.0%> (+3.1%) |
:arrow_up: |
... and 4 more |
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.
Are none of the tasks listed in this issue done? Or are you just not checking off the boxes as you go?
Also is the PR name correct? Is this really not to be merged until it gets all the way to being able to deploy the XBRL derived SQLite DBs to Datasette? Should it be a draft PR instead?
What issues are associated with the PR? That seems like a lot of work, which could/should be encapsulated in several issues at least. Is it really just #1861 ?
@zschira it looks like the tests are failing due to a settings parsing issue.
@zschira The tests might be failing because the main()
function in datasette_metadata_to_yml
doesn't return the integer 0 (zero) when it succeeds. It looks like it returns None
, and IIRC only sys.exit(0)
is interpreted as success by the CLI tests (and Unix CLI tools in general)
@zaneselvans the python docs seems to indicate that passing None
to sys.exit
will be treated the same as 0, so I'm not sure that's it unfortunately
It also seems like the datasette_metadata_to_yml
script is now the first test to trigger the PUDL ETL by depending indirectly on pudl_engine
somehow. Maybe there was an implicit dependency before that worked because of the order things were running in, but now it doesn't work because that other thing (like generating the FERC1 DBF derived DB?) doesn't happen until later on?
When I run the tests locally I'm able to reproduce the failure that seems to be showing up in the CI. I get this error output:
script_runner = <ScriptRunner inprocess>
pudl_settings_fixture = {'censusdp1tract_db': 'sqlite:////tmp/pytest-of-zane/pytest-11/pudl0/sqlite/censusdp1tract.sqlite', 'data_dir': '/home...0/sqlite/ferc1.sqlite', 'ferc1_xbrl_db': 'sqlite:////tmp/pytest-of-zane/pytest-11/pudl0/sqlite/ferc1_xbrl.sqlite', ...}
ferc1_xbrl_engine = Engine(sqlite:////tmp/pytest-of-zane/pytest-11/pudl0/sqlite/ferc1_xbrl.sqlite)
@pytest.mark.script_launch_mode("inprocess")
def test_datasette_metadata_script(
script_runner, pudl_settings_fixture, ferc1_xbrl_engine
):
"""Run datasette_metadata_to_yml for testing."""
metadata_yml = Path(pudl_settings_fixture["pudl_out"], "metadata.yml")
logger.info(f"Writing Datasette Metadata to {metadata_yml}")
ret = script_runner.run(
"datasette_metadata_to_yml",
"-o",
str(metadata_yml),
print_result=False,
)
> assert ret.success
E assert False
E + where False = <pytest_console_scripts.RunResult object at 0x7f6eac335840>.success
I wondered whether running the script at the command line would give a different result, so I naively just ran it like:
datasette_metadata_to_yml -o dude.yml
and unsurprisingly I got an error about there being no datapacakge.json
file where the script looked:
Traceback (most recent call last):
File "/home/zane/mambaforge/envs/pudl-dev/bin/datasette_metadata_to_yml", line 33, in <module>
sys.exit(load_entry_point('catalystcoop.pudl', 'console_scripts', 'datasette_metadata_to_yml')())
File "/home/zane/code/catalyst/pudl/src/pudl/convert/datasette_metadata_to_yml.py", line 46, in main
dm = DatasetteMetadata.from_data_source_ids(pudl_settings=pudl_settings)
File "/home/zane/code/catalyst/pudl/src/pudl/metadata/classes.py", line 1876, in from_data_source_ids
with open(pudl_settings[f"{xbrl_id}_descriptor"]) as f:
FileNotFoundError: [Errno 2] No such file or directory: '/home/zane/code/catalyst/pudl-work/sqlite/ferc1_xbrl_descriptor.json'
[1] 643141 exit 1 datasette_metadata_to_yml -o dude.yml
So I decided to look in the test directories, and see if it had generated the expected datapackage.json
files there.
Looking in /tmp/pytest-of-zane/pytest-current
I saw the following links and directories, which seemed a little bit weird:
./ ../ pudl0/ pudlcurrent@ test_datasette_metadata_script0/ test_datasette_metadata_scriptcurrent@
Usually there's only directories of the fomr pudlN/
and pudlcurrent@
which points at the directory with the biggest N, and everything that's been output by the PUDL tests is in there.
Note that there's nothing but empty directories under the test_datasette_metadata_script0
directory:
find ./test_datasette_metadata_script0
./test_datasette_metadata_script0
./test_datasette_metadata_script0/script-cwd
The other scripts that are tested in a way that's similar to this one in the test/integration/console_scripts_test.py
module don't produce a directory like this in their outputs. Also, note that given the way this test module generates the list of tests to run, it should attempt to run datasette_metadata_to_yml --help
(since it just iterates through all of the entrypoints that are defined in setup.py
):
"""Test the PUDL console scripts from within PyTest."""
import pkg_resources
import pytest
# Obtain a list of all deployed entry point scripts to test:
PUDL_SCRIPTS = [
ep.name
for ep in pkg_resources.iter_entry_points("console_scripts")
if ep.module_name.startswith("pudl")
]
@pytest.mark.parametrize("script_name", PUDL_SCRIPTS)
@pytest.mark.script_launch_mode("inprocess")
def test_pudl_scripts(script_runner, script_name):
"""Run each console script in --help mode for testing."""
ret = script_runner.run(script_name, "--help", print_result=False)
assert ret.success
Those tests seem to run just fine and all pass:
test/integration/console_scripts_test.py::test_pudl_scripts[inprocess-censusdp1tract_to_sqlite] PASSED
test/integration/console_scripts_test.py::test_pudl_scripts[inprocess-datasette_metadata_to_yml] PASSED
test/integration/console_scripts_test.py::test_pudl_scripts[inprocess-epacems_to_parquet] PASSED
test/integration/console_scripts_test.py::test_pudl_scripts[inprocess-ferc_to_sqlite] PASSED
test/integration/console_scripts_test.py::test_pudl_scripts[inprocess-metadata_to_rst] PASSED
test/integration/console_scripts_test.py::test_pudl_scripts[inprocess-pudl_datastore] PASSED
test/integration/console_scripts_test.py::test_pudl_scripts[inprocess-pudl_etl] PASSED
test/integration/console_scripts_test.py::test_pudl_scripts[inprocess-pudl_setup] PASSED
test/integration/console_scripts_test.py::test_pudl_scripts[inprocess-state_demand] PASSED
And they do not seem to result in any script-specific directories being created.
The very next test that runs is the standalone datasette_metadata_test
:
test/integration/datasette_metadata_test.py::test_datasette_metadata_script[inprocess]
And it does create that additional script-specific output directory under /tmp/pytest-of-zane/...
that's separate from the rest of the pudl
test outputs are going.
From the pytest-console-scripts
documentation it looks like there's an argument to script_runner.run()
(cwd
) that lets you set the current working directory of the script when it's run. It doesn't look like either of us are passing it that argument, but in the iterative script tests, we're really just testing whether they import everything, have no syntax errors, etc. so it doesn't matter where it runs. While in the new test it's actually trying to write a file and access other files that have previously been output by the ETL process, it does need to know where it is in the filesystem? Or where to find the datapackage.json files which have been output alongside the SQLite files.
However, because you're calling the script directly at the command line (virtually, inside the tests) rather than calling the function / method you want to test and passing in the test-environment specific pudl_settings
dictionary, which has the paths to the temporary directories created by pytest, it's generating its own incorrect set of default paths, and using those to try and look for the datapackage.json file next to the SQLite file, and it finds no such file and the script fails.
Switching the script style test out for one that uses the metadata generation script directly:
def test_datasette_metadata_to_yml(pudl_settings_fixture, ferc1_xbrl_engine):
"""Run datasette_metadata_to_yml for testing."""
metadata_yml = Path(pudl_settings_fixture["pudl_out"], "metadata.yml")
logger.info(f"Writing Datasette Metadata to {metadata_yml}")
dm = DatasetteMetadata.from_data_source_ids(pudl_settings=pudl_settings_fixture)
dm.to_yaml(path=metadata_yml)
logger.info("Parsing generated metadata using datasette utils.")
metadata_json = json.dumps(yaml.safe_load(metadata_yml.open()))
parsed_metadata = datasette.utils.parse_metadata(metadata_json)
assert set(parsed_metadata["databases"]) == {"pudl", "ferc1"}
assert parsed_metadata["license"] == "CC-BY-4.0"
assert (
parsed_metadata["databases"]["pudl"]["source_url"]
== "https://github.com/catalyst-cooperative/pudl"
)
assert (
parsed_metadata["databases"]["pudl"]["tables"]["plants_entity_eia"][
"label_column"
]
== "plant_name_eia"
)
for tbl_name in parsed_metadata["databases"]["pudl"]["tables"]:
assert (
parsed_metadata["databases"]["pudl"]["tables"][tbl_name]["columns"]
is not None
)
It gets to the point of trying to construct the DatasetteMetadata
instance and fails, with several hundred Pydantic validation errors. It seems like it's working with real data though, and getting some types it doesn't expect (like duration
) and some extra fields in resource definitions that are prohibited by the Pydanic model. A sampling:
E extra fields not permitted (type=value_error.extra)
E resources -> 306 -> schema -> fields -> 9 -> type
E unexpected value; permitted: 'string', 'number', 'integer', 'boolean', 'date', 'datetime', 'year' (type=value_error.const; given=duration; permitted=('string', 'number', 'integer', 'boolean', 'date', 'datetime', 'year'))
E resources -> 306 -> schema -> fields -> 12 -> type
E unexpected value; permitted: 'string', 'number', 'integer', 'boolean', 'date', 'datetime', 'year' (type=value_error.const; given=duration; permitted=('string', 'number', 'integer', 'boolean', 'date', 'datetime', 'year'))
E resources -> 306 -> encoding
E extra fields not permitted (type=value_error.extra)
E resources -> 306 -> sqlite
E extra fields not permitted (type=value_error.extra)
E resources -> 307 -> encoding
E extra fields not permitted (type=value_error.extra)
E resources -> 307 -> sqlite
And the final pytest error:
FAILED test/integration/datasette_metadata_test.py::test_datasette_metadata_to_yml - pydantic.error_wrappers.ValidationError: 734 validation errors for Package
@zaneselvans this is very helpful, thanks! Those validation errors errors are showing up because I changed the ferc-xbrl-extractor
dependency to be off a git branch to test something, but I've changed that back now, and I believe everything should be fixed now
🎉