pudl icon indicating copy to clipboard operation
pudl copied to clipboard

Run ferc_to_sqlite and pudl_etl independent of integration tests

Open rousik opened this issue 1 year ago • 24 comments

This reworks how tests (unit, integration) are done by github actions.

In particular, it modifies how ci-integration is executed by running ferc_to_sqlite and pudl_etl commands independently, making use of dagster parallelism and larger runners.

rousik avatar Sep 03 '23 16:09 rousik

I think the call to pytest is failing because you need to install some or all of the extra dependencies when you install the catalystcoop.pudl package -- right now you're just installing the bare minimum requirements. so...

pip install ./[doc,test,dev,datasette]

rather than just

pip install .

zaneselvans avatar Sep 06 '23 17:09 zaneselvans

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Now the tests are failing because they're looking for the databases in the wrong location -- in the temporary directory that they would have been written to if they'd been built with the fixtures, rather than the directories specified by $PUDL_INPUT and $PUDL_OUTPUT. Those envvars are being passed into the Tox environment though, so it seems as if there may be a problem in how the correct output directory is identified by the settings.

zaneselvans avatar Sep 07 '23 00:09 zaneselvans

A separate issue here is that we're breaking the python environment isolation that Tox provides by running pudl_etl outside of it and installing from the local directory, rather than from the package that Tox builds, in the virtual environment that Tox creates, which means it might have access to files that appear in the repository that don't appear in the package. Instead we could run the ETL inside Tox like is done with tox -e nuke.

zaneselvans avatar Sep 07 '23 00:09 zaneselvans

Hmm. I tried running the following locallly:

tox -e integration -- --live-dbs

And it didn't have any trouble finding the correct pre-existing DBs in my $PUDL_OUTPUT directory.

zaneselvans avatar Sep 07 '23 01:09 zaneselvans

t

I'm still contemplating what would be the best approach here. I would like to reuse tox, but I would also like to be able to run the three steps (ferc_to_sqlite, pudl_etl, integration tests) independently as that AFAIK allows better timing insights (how long each stage took) and gives you faster root cause/logs (i.e. you know that ETL or ferc_to_sqlite failed w/o having this intermixed with integration test outputs/logs).

I suppose the missing bit here is telling tox that all these three operations should run with the same tox environment and that we'd like to cache this environment between github action steps. Let me investigate some more.

rousik avatar Sep 11 '23 12:09 rousik

I have switched this back to use tox to run the commands && perhaps made it such that coverage is exported both for ferc_to_sqlite and pudl_etl (I still need to figure out how to verify that).

I have also installed tox plugins that should allow us to reuse the same environment for all of the three actions. The only remaining question is whether tox and GHAs are interoperable for caching/reuse of envs across separate runs. My initial assessment seems to reveal a lot of people struggling with this and if we can't cache this, we may be paying recurrent cost for rebuilding the same env over and over.

rousik avatar Sep 11 '23 17:09 rousik

I've fixed minor typos and tried to make sure that our coverage collection works. AFAICT, to make this work, coverage needs to be run with --concurrency=multiprocessing option which generates several .coverage-${stuff} files for each process, followed by coverage combine --append which will merge these fragments into single .coverage report.

Right now, I'm kind of assuming that the three operations (ferc_to_sqlite, pudl_etl and integration tests) will be run in a sequence and all append into single coverage file which will then be converted to xml report as part of the integration tests.

I will see if the most recent run will show promise.

It still feels that using tox under the hood here just makes things more complicated by adding yet another indirection layer that doesn't really do us much good here, i.e. we don't really benefit much from the tox env management (and we need to specifically force tox to reuse environments across the run) and the dependency between the steps and coverage files is just a bit painful here.

rousik avatar Sep 25 '23 21:09 rousik

Codecov Report

Attention: 7 lines in your changes are missing coverage. Please review.

Comparison is base (f4f2470) 92.6% compared to head (c51d8e2) 90.1%. Report is 7 commits behind head on main.

Files Patch % Lines
...rc/pudl/analysis/record_linkage/embed_dataframe.py 92.1% 5 Missing :warning:
...rc/pudl/analysis/record_linkage/eia_ferc1_train.py 0.0% 2 Missing :warning:
Additional details and impacted files
@@           Coverage Diff           @@
##            main   #2825     +/-   ##
=======================================
- Coverage   92.6%   90.1%   -2.5%     
=======================================
  Files        143     144      +1     
  Lines      12979   13084    +105     
=======================================
- Hits       12025   11792    -233     
- Misses       954    1292    +338     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Oct 03 '23 20:10 codecov[bot]

@jdangerx it looks like this PR is working now, in that it runs the ETL before the tests and collects coverage from all of that, but it also doesn't seem to have resulted in any speedup whatsoever, even though it's running on the bigger runner. Any ideas as to why that might be? At least looking at the ferc-to-sqlite output, it seems like the XBRL extraction is happening in groups of 4 parallel chunks at a time.

zaneselvans avatar Oct 19 '23 15:10 zaneselvans

It looks like the coverage isn't completely working since it dropped from 88% to 59%. This is more coverage than we get from just doing the imports (which IIRC is like 30%) so some coverage is being collected, but apparently a bunch is still getting lost by virtue of the way the ETL is getting run.

Hmm, looking at the coverage report for the PR it looks like huge swaths of the transform functions have lost coverage, so I assume the pudl_etl isn't actually collecting any coverage. The ferc_to_sqlite CLI module also doesn't appear to be getting covered, so I don't think the coverage from either script is working.

zaneselvans avatar Oct 19 '23 16:10 zaneselvans

It looks like the coverage isn't completely working since it dropped from 88% to 59%. This is more coverage than we get from just doing the imports (which IIRC is like 30%) so some coverage is being collected, but apparently a bunch is still getting lost by virtue of the way the ETL is getting run.

Hmm, looking at the coverage report for the PR it looks like huge swaths of the transform functions have lost coverage, so I assume the pudl_etl isn't actually collecting any coverage. The ferc_to_sqlite CLI module also doesn't appear to be getting covered, so I don't think the coverage from either script is working.

Hmph. That is unfortunate. I'm not sure what else to try to make coverage interop well with dagster then :-/

rousik avatar Oct 20 '23 18:10 rousik

@jdangerx it looks like this PR is working now, in that it runs the ETL before the tests and collects coverage from all of that, but it also doesn't seem to have resulted in any speedup whatsoever, even though it's running on the bigger runner. Any ideas as to why that might be? At least looking at the ferc-to-sqlite output, it seems like the XBRL extraction is happening in groups of 4 parallel chunks at a time.

If you look at top of this run https://github.com/catalyst-cooperative/pudl/actions/runs/6576790075/job/17867156644#step:1:2, you can see runner name ubuntu-22.04-4core_674e4143c3a6 which would imply 4 core runner.

rousik avatar Oct 20 '23 18:10 rousik

I've made some changes, including:

  • human friendly naming of the action & steps
  • ditching tox as the intermediary

rousik avatar Oct 25 '23 07:10 rousik

We have determined that parallelism works as expected (the ETL just got a lot slower), so the only remaining piece is coverage, which still seems to be broken. I will try to dig into that some more to see if there's anything that can be done about it.

rousik avatar Nov 08 '23 03:11 rousik

I do not know what changed, but it seems that coverage works well now :shrug: so this should be good to go

rousik avatar Nov 28 '23 03:11 rousik

Somehow it seems like the integration tests are still taking the same amount of time to run. Do we expect to get a speedup from parallelism? Or is that still not enabled here?

zaneselvans avatar Nov 28 '23 19:11 zaneselvans

@rousik It doesn't seem like the coverage results are showing up in the checks anymore.

Hmm. It looks like there's some kind of error in processing the coverage results that were uploaded.

https://app.codecov.io/gh/catalyst-cooperative/pudl/pull/2825

zaneselvans avatar Nov 29 '23 06:11 zaneselvans

Hmm, so the coverage is sadly not working 😢

zaneselvans avatar Nov 29 '23 20:11 zaneselvans

I think I'm going to loose my mind here. I thought I finally figured the right configuration here only to find out that it still seems to be busted, so back to the rabbit hole of investigations 🤦

rousik avatar Dec 13 '23 21:12 rousik

@rousik I merged dev in to get the pre-commit version of the lockfile updates, which should involve much less churn.

It seems like it's mostly getting the coverage, with 90.1%. Are the tests themselves not being covered now maybe? It looks like somehow the test/unit and test/integration lines in the coverage config section of pyproject.toml got lost merging. I added them back in.

zaneselvans avatar Dec 15 '23 04:12 zaneselvans

@rousik Can you update the main description for this PR to reflect the current situation? It's been through so many changes and the code on main has changed so much that I'm no longer clear what the net effect is.

The integration test run time seems to be almost identical to the current tests (~1.5 hours).

Is the primary change that this PR is running the ETL outside the context of pytest? Where were we going with that? Is the next step to kick off the FERC to SQLite conversions as their own jobs on smaller runners which can run in parallel rather than serially, and have the main PUDL ETL start on a larger runner as soon as the FERC 1 DBs are available, hopefully shortening the overall integration test runtime and also reducing the running time for the paid runner?

zaneselvans avatar Jan 17 '24 20:01 zaneselvans

@rousik Can you update the main description for this PR to reflect the current situation? It's been through so many changes and the code on main has changed so much that I'm no longer clear what the net effect is.

The integration test run time seems to be almost identical to the current tests (~1.5 hours).

Is the primary change that this PR is running the ETL outside the context of pytest? Where were we going with that? Is the next step to kick off the FERC to SQLite conversions as their own jobs on smaller runners which can run in parallel rather than serially, and have the main PUDL ETL start on a larger runner as soon as the FERC 1 DBs are available, hopefully shortening the overall integration test runtime and also reducing the running time for the paid runner?

I suppose I need to rerun some of these tests again, because we should be getting speed-ups from this even if nothing more is done. Isolating ferc_to_sqlite, pudl_etl and integration tests in separate steps should make it a little more easy to diagnose which parts failed, and should also (as pointed out) allow us to further speed-up/improve the individual steps as needed. We could beef-up the pudl_etl part, hopefully getting faster there, and then sharding ferc_to_sqlite which should be a good time improvement.

rousik avatar Jan 18 '24 05:01 rousik

@rousik Can you update the main description for this PR to reflect the current situation? It's been through so many changes and the code on main has changed so much that I'm no longer clear what the net effect is.

The integration test run time seems to be almost identical to the current tests (~1.5 hours).

Is the primary change that this PR is running the ETL outside the context of pytest? Where were we going with that? Is the next step to kick off the FERC to SQLite conversions as their own jobs on smaller runners which can run in parallel rather than serially, and have the main PUDL ETL start on a larger runner as soon as the FERC 1 DBs are available, hopefully shortening the overall integration test runtime and also reducing the running time for the paid runner?

I have re-read contents of this PR and description and I think they are still in sync. This refactors ci-integration into smaller individual tasks.

rousik avatar Jan 18 '24 05:01 rousik