splink icon indicating copy to clipboard operation
splink copied to clipboard

cluster_pairwise_predictions_at_threshold is crashing on Databricks in v4.0.11

Open Brice543 opened this issue 3 weeks ago • 5 comments

What happens?

Since the new version the function linker.clustering.cluster_pairwise_predictions_at_threshold() is crashing at the end of the process due to missing tables. It's saying that the tables "filtered_neighbours" are not existing in Databricks.

I check and seems to be because the drop command is execute twice in the connected_component.py. Should maybe run the command "DROP IF EXIST" to secure this behaviour.

Image

To Reproduce

  1. Get the latest version of the librairie on Databricks.
  2. Run a classical run until the clustering part
  3. Run the function clustering.cluster_pairwise_predictions_at_threshold() ==> Crashing due to missing tables

OS:

Databricks DBR

Splink version:

4.0.11

Have you tried this on the latest master branch?

  • [x] I agree

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

  • [x] I agree

Brice543 avatar Nov 28 '25 06:11 Brice543

I'm currently evaluating whether to use this library for a prouction use case and am getting the same error. Running the tutorial code seems like a basic test that should be done before release. As a consequence I don't think we can use this library. Are you open for contributors to submit PR's?

Image

shaunryan avatar Nov 29 '25 11:11 shaunryan

I'm currently evaluating whether to use this library for a prouction use case and am getting the same error. Running the tutorial code seems like a basic test that should be done before release. As a consequence I don't think we can use this library. Are you open for contributors to submit PR's?

Image

Hello shaunryan,

Just saw your message. To solve your issue try to define your default catalog and schema at the beginning of your notebook with: %sql USE CATALOG xxxxx USE SCHEMA xxxxx

Otherwise, just select the previous version of the library. It’s working.

Brice543 avatar Nov 29 '25 11:11 Brice543

I think this is fixed, just not been released yet: https://github.com/moj-analytical-services/splink/pull/2826

If you install from GitHub it should work, let me know if it doesn't.

We run the tutorial in CI but I think this is a databricks issue, and we can't easily run.CI against that environment

RobinL avatar Nov 30 '25 21:11 RobinL

You can also use break_lineage_method="parquet" in your SparkAPI as a workaround until the fix is released.

aymonwuolanne avatar Nov 30 '25 21:11 aymonwuolanne

@RobinL thanks for the answer. Will check that.

@aymonwuolanne, I tried with parquet and delta-lake but was still not working. specify the previous version when installing the library works.

Brice543 avatar Dec 01 '25 09:12 Brice543