pudl
pudl copied to clipboard
Integrate splink model for FERC to EIA record linkage
Overview
Integrate the splink
model into PUDL to conduct the FERC to EIA entity matching.
Closes 32 and 10 but more generally is one of the last lingering things in that CCAI repo.
What problem does this address?
After merging this we should be able to directly compare the output of the existing PUDL model that uses logistic regression to perform entity matching for FERC1 to EIA to this splink
model. Then we can choose which one we want to primarily use (I think probably splink) based on what metrics we want to prioritize. I also have a call scheduled with RMI to get their input on comparing matches and metrics to prioritize.
This comment gives a first look at how the splink model compares to the PUDL results.
Is it okay to temporarily have both match results in the DB?
What did you change?
Integrated a new module with the splink
model for entity matching.
Testing
How did you make sure this worked? How can a reviewer verify this?
Pros and Cons of Splink
Pros
- Model performance is less dependent on the quality of the training data
- Active discussion and development in the splink repo. A lot of new features seem to be going in.
- Lots of helpful visualizations and tools to make sure the model is doing what you think it should be doing and parameters are being weighted in an intuitive way
- The results from conducting an unsupervised match look similar
- Seems like they're headed in the direction of using labeling functions to create more training data that iteratively improves model results.
- We don't rely on the manually created and slightly fishy
plant_id_pudl
orutility_id_pudl
for blocking
Cons
- If development stopped on
splink
then the structure is less transferrable to a more traditional linear model built withsklearn
. - There's maybe some weirdness with dependencies, but that's maybe always sort of the case when we add cool new dependencies to PUDL.
# To-do list
- [x] Figure out a fix for the way `poetry` handles the `sqlalchemy` dependency in `splink` that makes it incompatible with `pudl` (see above discussion).
- [x] Add in a `devtools` notebook that can output all the helpful `splink` visualizations for this match as well as the unsupervised matching model
- [x] Track precision and recall
- [x] Improve performance of blocking rules - try soundex/dmetaphone, try faiss
- [x] Move metrics to experiment tracking infra
- [x] Do a comparison of all the metrics in `prettyify_best_matches` between the splink and existing PUDL model.
- [ ] Go through remaining TODO comments
- [ ] Actually persist matching results to DB once we're done tinkering with it
- [ ] Remove old model and update asset names
- [ ] Add more testing
- [ ] Make sure full ETL runs & `make pytest-integration-full` passes locally
- [ ] For major data coverage & analysis changes, [run data validation tests](https://catalystcoop-pudl.readthedocs.io/en/latest/dev/testing.html#data-validation)
- [ ] If updating analyses or data processing functions: make sure to update or write data validation tests
- [ ] Update the [release notes](../docs/release_notes.rst): reference the PR and related issues.
- [x] Review the PR yourself and call out any questions or issues you have
I think the "right" solution here (if we can get it implemented) is probably for splink
to remove the sqlalchemy<2
constraint. If it's really only there to prevent issues with pandas<2
that should probably be the responsibility of users who are depending on splink
-- if splink
allows any version of sqlalchemy
then users who need to use pandas<2
for some reason can pin their own dependencies. I don't see any issues related to this in the splink
repo, so maybe we should create one to discuss and see if there are other blocking issues for them. Given that SQLAlchemy 2.x has been out for a year, and pandas 2.0 has been out for 9 months, I don't think this would be an unreasonable request. Pandas 3.0 is slated to come out this spring.
As far as where the dependency restriction we're specifically running into is coming from, if we're installing from conda-forge
, it'll be in the recipe for the conda package:
requirements:
host:
- python >=3.8,<4.0
- pip
- poetry >=0.12
run:
- python >=3.8,<4.0
- jsonschema >=3.2,<5.0
- python-duckdb >=0.8.0
- pandas >=1.2.0
- sqlglot >=7.0.0
- altair >=5.0.1,<6.0
- jinja2 >=3.0.3,<4.0.0
run_constrained:
- pyspark >=3.2.1
- sqlalchemy >=1.4.0,<2.0.0
I had not previously encountered run_constrained
but here's the documentation from conda. The conda recipe dependencies are typically manually compiled, so this is probably an imperfect translation of the poetry dependency specs. Also I think one could probably implement the less restrictive dependencies just by changing the conda packaging, but that's kinda hacky. And who knows maybe there are some other real issues with sqlalchemy 2.
Just for testing, you might try manually installing splink and its dependencies into the pudl-dev
environment with pip
and see if everything works, or if it runs into real problems. Conversely, you might try relaxing the sqlalchemy
requirement in PUDL. IIRC we're not compatible with v1.4.x but I could be remembering that wrong.
I think the "right" solution here (if we can get it implemented) is probably for
splink
to remove thesqlalchemy<2
constraint.
Ya I think that's probably best as well. I didn't notice that they're not on the latest pandas version. I'll make an issue, they're usually pretty responsive.
Just for testing, you might try manually installing splink and its dependencies into the
pudl-dev
environment withpip
and see if everything works, or if it runs into real problems.
Yep, that's what I've been doing for now.
See also this issue in the splink repo: https://github.com/moj-analytical-services/splink/issues/1906
@zschira there's still a lengthy tasklist for this PR but I think you can give it a first look now and familiarize.
Check out this pull request on
See visual diffs & provide feedback on Jupyter Notebooks.
Powered by ReviewNB
Both the splink
model and PUDL model have precision scores (a measure of accuracy when the model makes a prediction) around .94 and recall scores (a measure of coverage of FERC records) of .9. The splink
model has better coverage of the steam plants table in particular, which is something that RMI is interested in optimizing for. The matches from the PUDL model are slightly more consistent with plant_id_ferc1
across time (.75 vs .73).
The splink
model doesn't use the manually generated plant_id_pudl
or utility_id_pudl
to block on, so it's good that it's getting comparable metrics because we don't want to keep relying on those columns.
@zschira I know this is a mega big PR so I think I left a comment everywhere that I had questions. Besides that, I think what would be most helpful is if you tried to create the environment and run the splink model, to make sure everything works for you. The idea is that this PR will fully replace the linear regression model with this splink model since results are comparable and this model is more adaptable and efficient. When this PR is ready to go, I'll take out the old model and rename the splink assets to be the old asset name.
The bottleneck right now is actually the name cleaning. There's probably a way to speed this up?
Also, the devtools notebook that I added is the best way I could come up with to show a bunch of visuals about the model. I'm not sure what the best way to maintain this notebook is. Or if the visualizations could somehow be integrated into the experiment tracking.