pudl icon indicating copy to clipboard operation
pudl copied to clipboard

Integrate splink model for FERC to EIA record linkage

Open katie-lamb opened this issue 1 year ago • 7 comments

Overview

Integrate the splink model into PUDL to conduct the FERC to EIA entity matching.

Closes 32 and 10 but more generally is one of the last lingering things in that CCAI repo.

What problem does this address?

After merging this we should be able to directly compare the output of the existing PUDL model that uses logistic regression to perform entity matching for FERC1 to EIA to this splink model. Then we can choose which one we want to primarily use (I think probably splink) based on what metrics we want to prioritize. I also have a call scheduled with RMI to get their input on comparing matches and metrics to prioritize.

This comment gives a first look at how the splink model compares to the PUDL results.

Is it okay to temporarily have both match results in the DB?

What did you change?

Integrated a new module with the splink model for entity matching.

Testing

How did you make sure this worked? How can a reviewer verify this?

Pros and Cons of Splink

Pros

  • Model performance is less dependent on the quality of the training data
  • Active discussion and development in the splink repo. A lot of new features seem to be going in.
  • Lots of helpful visualizations and tools to make sure the model is doing what you think it should be doing and parameters are being weighted in an intuitive way
  • The results from conducting an unsupervised match look similar
  • Seems like they're headed in the direction of using labeling functions to create more training data that iteratively improves model results.
  • We don't rely on the manually created and slightly fishy plant_id_pudl or utility_id_pudl for blocking

Cons

  • If development stopped on splink then the structure is less transferrable to a more traditional linear model built with sklearn.
  • There's maybe some weirdness with dependencies, but that's maybe always sort of the case when we add cool new dependencies to PUDL.
# To-do list
- [x] Figure out a fix for the way `poetry` handles the `sqlalchemy` dependency in `splink` that makes it incompatible with `pudl` (see above discussion).
- [x] Add in a `devtools` notebook that can output all the helpful `splink` visualizations for this match as well as the unsupervised matching model
- [x] Track precision and recall
- [x] Improve performance of blocking rules - try soundex/dmetaphone, try faiss
- [x] Move metrics to experiment tracking infra
- [x] Do a comparison of all the metrics in `prettyify_best_matches` between the splink and existing PUDL model.
- [ ] Go through remaining TODO comments
- [ ] Actually persist matching results to DB once we're done tinkering with it
- [ ] Remove old model and update asset names
- [ ] Add more testing
- [ ] Make sure full ETL runs & `make pytest-integration-full` passes locally
- [ ] For major data coverage & analysis changes, [run data validation tests](https://catalystcoop-pudl.readthedocs.io/en/latest/dev/testing.html#data-validation)
- [ ] If updating analyses or data processing functions: make sure to update or write data validation tests
- [ ] Update the [release notes](../docs/release_notes.rst): reference the PR and related issues.
- [x] Review the PR yourself and call out any questions or issues you have

katie-lamb avatar Jan 28 '24 18:01 katie-lamb

I think the "right" solution here (if we can get it implemented) is probably for splink to remove the sqlalchemy<2 constraint. If it's really only there to prevent issues with pandas<2 that should probably be the responsibility of users who are depending on splink -- if splink allows any version of sqlalchemy then users who need to use pandas<2 for some reason can pin their own dependencies. I don't see any issues related to this in the splink repo, so maybe we should create one to discuss and see if there are other blocking issues for them. Given that SQLAlchemy 2.x has been out for a year, and pandas 2.0 has been out for 9 months, I don't think this would be an unreasonable request. Pandas 3.0 is slated to come out this spring.

As far as where the dependency restriction we're specifically running into is coming from, if we're installing from conda-forge, it'll be in the recipe for the conda package:

requirements:
  host:
    - python >=3.8,<4.0
    - pip
    - poetry >=0.12
  run:
    - python >=3.8,<4.0
    - jsonschema >=3.2,<5.0
    - python-duckdb >=0.8.0
    - pandas >=1.2.0
    - sqlglot >=7.0.0
    - altair >=5.0.1,<6.0
    - jinja2 >=3.0.3,<4.0.0
  run_constrained:
    - pyspark >=3.2.1
    - sqlalchemy >=1.4.0,<2.0.0

I had not previously encountered run_constrained but here's the documentation from conda. The conda recipe dependencies are typically manually compiled, so this is probably an imperfect translation of the poetry dependency specs. Also I think one could probably implement the less restrictive dependencies just by changing the conda packaging, but that's kinda hacky. And who knows maybe there are some other real issues with sqlalchemy 2.

Just for testing, you might try manually installing splink and its dependencies into the pudl-dev environment with pip and see if everything works, or if it runs into real problems. Conversely, you might try relaxing the sqlalchemy requirement in PUDL. IIRC we're not compatible with v1.4.x but I could be remembering that wrong.

zaneselvans avatar Jan 29 '24 15:01 zaneselvans

I think the "right" solution here (if we can get it implemented) is probably for splink to remove the sqlalchemy<2 constraint.

Ya I think that's probably best as well. I didn't notice that they're not on the latest pandas version. I'll make an issue, they're usually pretty responsive.

Just for testing, you might try manually installing splink and its dependencies into the pudl-dev environment with pip and see if everything works, or if it runs into real problems.

Yep, that's what I've been doing for now.

katie-lamb avatar Jan 29 '24 17:01 katie-lamb

See also this issue in the splink repo: https://github.com/moj-analytical-services/splink/issues/1906

zaneselvans avatar Jan 29 '24 20:01 zaneselvans

@zschira there's still a lengthy tasklist for this PR but I think you can give it a first look now and familiarize.

katie-lamb avatar Feb 02 '24 00:02 katie-lamb

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Both the splink model and PUDL model have precision scores (a measure of accuracy when the model makes a prediction) around .94 and recall scores (a measure of coverage of FERC records) of .9. The splink model has better coverage of the steam plants table in particular, which is something that RMI is interested in optimizing for. The matches from the PUDL model are slightly more consistent with plant_id_ferc1 across time (.75 vs .73).

The splink model doesn't use the manually generated plant_id_pudl or utility_id_pudl to block on, so it's good that it's getting comparable metrics because we don't want to keep relying on those columns.

katie-lamb avatar Feb 27 '24 02:02 katie-lamb

@zschira I know this is a mega big PR so I think I left a comment everywhere that I had questions. Besides that, I think what would be most helpful is if you tried to create the environment and run the splink model, to make sure everything works for you. The idea is that this PR will fully replace the linear regression model with this splink model since results are comparable and this model is more adaptable and efficient. When this PR is ready to go, I'll take out the old model and rename the splink assets to be the old asset name.

The bottleneck right now is actually the name cleaning. There's probably a way to speed this up?

Also, the devtools notebook that I added is the best way I could come up with to show a bunch of visuals about the model. I'm not sure what the best way to maintain this notebook is. Or if the visualizations could somehow be integrated into the experiment tracking.

katie-lamb avatar Feb 27 '24 06:02 katie-lamb