pudl
pudl copied to clipboard
Run ferc_to_sqlite and pudl_etl independent of integration tests
This reworks how tests (unit, integration) are done by github actions.
In particular, it modifies how ci-integration is executed by running ferc_to_sqlite
and pudl_etl
commands independently, making use of dagster parallelism and larger runners.
I think the call to pytest
is failing because you need to install some or all of the extra dependencies when you install the catalystcoop.pudl
package -- right now you're just installing the bare minimum requirements. so...
pip install ./[doc,test,dev,datasette]
rather than just
pip install .
Check out this pull request on
See visual diffs & provide feedback on Jupyter Notebooks.
Powered by ReviewNB
Now the tests are failing because they're looking for the databases in the wrong location -- in the temporary directory that they would have been written to if they'd been built with the fixtures, rather than the directories specified by $PUDL_INPUT
and $PUDL_OUTPUT
. Those envvars are being passed into the Tox environment though, so it seems as if there may be a problem in how the correct output directory is identified by the settings.
A separate issue here is that we're breaking the python environment isolation that Tox provides by running pudl_etl
outside of it and installing from the local directory, rather than from the package that Tox builds, in the virtual environment that Tox creates, which means it might have access to files that appear in the repository that don't appear in the package. Instead we could run the ETL inside Tox like is done with tox -e nuke
.
Hmm. I tried running the following locallly:
tox -e integration -- --live-dbs
And it didn't have any trouble finding the correct pre-existing DBs in my $PUDL_OUTPUT
directory.
t
I'm still contemplating what would be the best approach here. I would like to reuse tox, but I would also like to be able to run the three steps (ferc_to_sqlite, pudl_etl, integration tests) independently as that AFAIK allows better timing insights (how long each stage took) and gives you faster root cause/logs (i.e. you know that ETL or ferc_to_sqlite failed w/o having this intermixed with integration test outputs/logs).
I suppose the missing bit here is telling tox that all these three operations should run with the same tox environment and that we'd like to cache this environment between github action steps. Let me investigate some more.
I have switched this back to use tox
to run the commands && perhaps made it such that coverage is exported both for ferc_to_sqlite
and pudl_etl
(I still need to figure out how to verify that).
I have also installed tox plugins that should allow us to reuse the same environment for all of the three actions. The only remaining question is whether tox and GHAs are interoperable for caching/reuse of envs across separate runs. My initial assessment seems to reveal a lot of people struggling with this and if we can't cache this, we may be paying recurrent cost for rebuilding the same env over and over.
I've fixed minor typos and tried to make sure that our coverage
collection works. AFAICT, to make this work, coverage needs to be run with --concurrency=multiprocessing
option which generates several .coverage-${stuff}
files for each process, followed by coverage combine --append
which will merge these fragments into single .coverage
report.
Right now, I'm kind of assuming that the three operations (ferc_to_sqlite, pudl_etl and integration tests) will be run in a sequence and all append into single coverage file which will then be converted to xml
report as part of the integration tests.
I will see if the most recent run will show promise.
It still feels that using tox
under the hood here just makes things more complicated by adding yet another indirection layer that doesn't really do us much good here, i.e. we don't really benefit much from the tox env management (and we need to specifically force tox to reuse environments across the run) and the dependency between the steps and coverage files is just a bit painful here.
Codecov Report
Attention: 7 lines
in your changes are missing coverage. Please review.
Comparison is base (
f4f2470
) 92.6% compared to head (c51d8e2
) 90.1%. Report is 7 commits behind head on main.
Files | Patch % | Lines |
---|---|---|
...rc/pudl/analysis/record_linkage/embed_dataframe.py | 92.1% | 5 Missing :warning: |
...rc/pudl/analysis/record_linkage/eia_ferc1_train.py | 0.0% | 2 Missing :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## main #2825 +/- ##
=======================================
- Coverage 92.6% 90.1% -2.5%
=======================================
Files 143 144 +1
Lines 12979 13084 +105
=======================================
- Hits 12025 11792 -233
- Misses 954 1292 +338
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
@jdangerx it looks like this PR is working now, in that it runs the ETL before the tests and collects coverage from all of that, but it also doesn't seem to have resulted in any speedup whatsoever, even though it's running on the bigger runner. Any ideas as to why that might be? At least looking at the ferc-to-sqlite
output, it seems like the XBRL extraction is happening in groups of 4 parallel chunks at a time.
It looks like the coverage isn't completely working since it dropped from 88% to 59%. This is more coverage than we get from just doing the imports (which IIRC is like 30%) so some coverage is being collected, but apparently a bunch is still getting lost by virtue of the way the ETL is getting run.
Hmm, looking at the coverage report for the PR it looks like huge swaths of the transform functions have lost coverage, so I assume the pudl_etl
isn't actually collecting any coverage. The ferc_to_sqlite
CLI module also doesn't appear to be getting covered, so I don't think the coverage from either script is working.
It looks like the coverage isn't completely working since it dropped from 88% to 59%. This is more coverage than we get from just doing the imports (which IIRC is like 30%) so some coverage is being collected, but apparently a bunch is still getting lost by virtue of the way the ETL is getting run.
Hmm, looking at the coverage report for the PR it looks like huge swaths of the transform functions have lost coverage, so I assume the
pudl_etl
isn't actually collecting any coverage. Theferc_to_sqlite
CLI module also doesn't appear to be getting covered, so I don't think the coverage from either script is working.
Hmph. That is unfortunate. I'm not sure what else to try to make coverage
interop well with dagster then :-/
@jdangerx it looks like this PR is working now, in that it runs the ETL before the tests and collects coverage from all of that, but it also doesn't seem to have resulted in any speedup whatsoever, even though it's running on the bigger runner. Any ideas as to why that might be? At least looking at the
ferc-to-sqlite
output, it seems like the XBRL extraction is happening in groups of 4 parallel chunks at a time.
If you look at top of this run https://github.com/catalyst-cooperative/pudl/actions/runs/6576790075/job/17867156644#step:1:2, you can see runner name ubuntu-22.04-4core_674e4143c3a6
which would imply 4 core runner.
I've made some changes, including:
- human friendly naming of the action & steps
- ditching tox as the intermediary
We have determined that parallelism works as expected (the ETL just got a lot slower), so the only remaining piece is coverage, which still seems to be broken. I will try to dig into that some more to see if there's anything that can be done about it.
I do not know what changed, but it seems that coverage works well now :shrug: so this should be good to go
Somehow it seems like the integration tests are still taking the same amount of time to run. Do we expect to get a speedup from parallelism? Or is that still not enabled here?
@rousik It doesn't seem like the coverage results are showing up in the checks anymore.
Hmm. It looks like there's some kind of error in processing the coverage results that were uploaded.
https://app.codecov.io/gh/catalyst-cooperative/pudl/pull/2825
Hmm, so the coverage is sadly not working 😢
I think I'm going to loose my mind here. I thought I finally figured the right configuration here only to find out that it still seems to be busted, so back to the rabbit hole of investigations 🤦
@rousik I merged dev
in to get the pre-commit
version of the lockfile updates, which should involve much less churn.
It seems like it's mostly getting the coverage, with 90.1%. Are the tests themselves not being covered now maybe? It looks like somehow the test/unit
and test/integration
lines in the coverage config section of pyproject.toml
got lost merging. I added them back in.
@rousik Can you update the main description for this PR to reflect the current situation? It's been through so many changes and the code on main
has changed so much that I'm no longer clear what the net effect is.
The integration test run time seems to be almost identical to the current tests (~1.5 hours).
Is the primary change that this PR is running the ETL outside the context of pytest
? Where were we going with that? Is the next step to kick off the FERC to SQLite conversions as their own jobs on smaller runners which can run in parallel rather than serially, and have the main PUDL ETL start on a larger runner as soon as the FERC 1 DBs are available, hopefully shortening the overall integration test runtime and also reducing the running time for the paid runner?
@rousik Can you update the main description for this PR to reflect the current situation? It's been through so many changes and the code on
main
has changed so much that I'm no longer clear what the net effect is.The integration test run time seems to be almost identical to the current tests (~1.5 hours).
Is the primary change that this PR is running the ETL outside the context of
pytest
? Where were we going with that? Is the next step to kick off the FERC to SQLite conversions as their own jobs on smaller runners which can run in parallel rather than serially, and have the main PUDL ETL start on a larger runner as soon as the FERC 1 DBs are available, hopefully shortening the overall integration test runtime and also reducing the running time for the paid runner?
I suppose I need to rerun some of these tests again, because we should be getting speed-ups from this even if nothing more is done. Isolating ferc_to_sqlite
, pudl_etl
and integration tests in separate steps should make it a little more easy to diagnose which parts failed, and should also (as pointed out) allow us to further speed-up/improve the individual steps as needed. We could beef-up the pudl_etl
part, hopefully getting faster there, and then sharding ferc_to_sqlite
which should be a good time improvement.
@rousik Can you update the main description for this PR to reflect the current situation? It's been through so many changes and the code on
main
has changed so much that I'm no longer clear what the net effect is.The integration test run time seems to be almost identical to the current tests (~1.5 hours).
Is the primary change that this PR is running the ETL outside the context of
pytest
? Where were we going with that? Is the next step to kick off the FERC to SQLite conversions as their own jobs on smaller runners which can run in parallel rather than serially, and have the main PUDL ETL start on a larger runner as soon as the FERC 1 DBs are available, hopefully shortening the overall integration test runtime and also reducing the running time for the paid runner?
I have re-read contents of this PR and description and I think they are still in sync. This refactors ci-integration into smaller individual tasks.