cluster_pairwise_predictions_at_threshold is crashing on Databricks in v4.0.11
What happens?
Since the new version the function linker.clustering.cluster_pairwise_predictions_at_threshold() is crashing at the end of the process due to missing tables. It's saying that the tables "filtered_neighbours" are not existing in Databricks.
I check and seems to be because the drop command is execute twice in the connected_component.py. Should maybe run the command "DROP IF EXIST" to secure this behaviour.
To Reproduce
- Get the latest version of the librairie on Databricks.
- Run a classical run until the clustering part
- Run the function clustering.cluster_pairwise_predictions_at_threshold() ==> Crashing due to missing tables
OS:
Databricks DBR
Splink version:
4.0.11
Have you tried this on the latest master branch?
- [x] I agree
Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?
- [x] I agree
I'm currently evaluating whether to use this library for a prouction use case and am getting the same error. Running the tutorial code seems like a basic test that should be done before release. As a consequence I don't think we can use this library. Are you open for contributors to submit PR's?
I'm currently evaluating whether to use this library for a prouction use case and am getting the same error. Running the tutorial code seems like a basic test that should be done before release. As a consequence I don't think we can use this library. Are you open for contributors to submit PR's?
![]()
Hello shaunryan,
Just saw your message. To solve your issue try to define your default catalog and schema at the beginning of your notebook with: %sql USE CATALOG xxxxx USE SCHEMA xxxxx
Otherwise, just select the previous version of the library. It’s working.
I think this is fixed, just not been released yet: https://github.com/moj-analytical-services/splink/pull/2826
If you install from GitHub it should work, let me know if it doesn't.
We run the tutorial in CI but I think this is a databricks issue, and we can't easily run.CI against that environment
You can also use break_lineage_method="parquet" in your SparkAPI as a workaround until the fix is released.
@RobinL thanks for the answer. Will check that.
@aymonwuolanne, I tried with parquet and delta-lake but was still not working. specify the previous version when installing the library works.