splink cluster_pairwise_predictions_at_threshold is crashing on Databricks in v4.0.11

What happens?

Since the new version the function linker.clustering.cluster_pairwise_predictions_at_threshold() is crashing at the end of the process due to missing tables. It's saying that the tables "filtered_neighbours" are not existing in Databricks.

I check and seems to be because the drop command is execute twice in the connected_component.py. Should maybe run the command "DROP IF EXIST" to secure this behaviour.

To Reproduce

Get the latest version of the librairie on Databricks.
Run a classical run until the clustering part
Run the function clustering.cluster_pairwise_predictions_at_threshold() ==> Crashing due to missing tables

OS:

Databricks DBR

Splink version:

4.0.11

Have you tried this on the latest `master` branch?

[x] I agree

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

[x] I agree

Nov 28 '25 06:11 Brice543

I'm currently evaluating whether to use this library for a prouction use case and am getting the same error. Running the tutorial code seems like a basic test that should be done before release. As a consequence I don't think we can use this library. Are you open for contributors to submit PR's?

Nov 29 '25 11:11 shaunryan

I'm currently evaluating whether to use this library for a prouction use case and am getting the same error. Running the tutorial code seems like a basic test that should be done before release. As a consequence I don't think we can use this library. Are you open for contributors to submit PR's?

Hello shaunryan,

Just saw your message. To solve your issue try to define your default catalog and schema at the beginning of your notebook with: %sql USE CATALOG xxxxx USE SCHEMA xxxxx

Otherwise, just select the previous version of the library. It’s working.

Nov 29 '25 11:11 Brice543

I think this is fixed, just not been released yet: https://github.com/moj-analytical-services/splink/pull/2826

If you install from GitHub it should work, let me know if it doesn't.

We run the tutorial in CI but I think this is a databricks issue, and we can't easily run.CI against that environment

Nov 30 '25 21:11 RobinL

You can also use break_lineage_method="parquet" in your SparkAPI as a workaround until the fix is released.

Nov 30 '25 21:11 aymonwuolanne

@RobinL thanks for the answer. Will check that.

@aymonwuolanne, I tried with parquet and delta-lake but was still not working. specify the previous version when installing the library works.

Dec 01 '25 09:12 Brice543

cluster_pairwise_predictions_at_threshold is crashing on Databricks in v4.0.11

What happens?

To Reproduce

OS:

Splink version:

Have you tried this on the latest master branch?

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

Have you tried this on the latest `master` branch?