splink
splink copied to clipboard
[FEAT] Pairwise array comparison levels
Is your proposal related to a problem?
We have the udf dualarrayexplode
which gives us the Cartesian product of two arrays which we sometimes use for comparison levels. Currently we hand-roll the SQL, but it would be useful to have a shortcut for this.
Describe the solution you'd like
Would be nice to have a couple of ways to construct this - using string templates (or something similar), or from comparison levels.
For instance, pairwise levenshtein comparison level (minimum levenshtein distance between arrays less than 2)
# for each x in name_arr_l, y in name_arr_r, condition is min(levenshtein(x, y)) <= 2
cll.pairwise_array_level("name_arr", sql_snippet="levenshtein({col_l}, {col_r})", aggregate_fun="min", threshold=2)
or the same thing from a comparison level:
# for each x in name_arr_l, y in name_arr_r, condition is min(levenshtein(x, y)) <= 2
cll.pairwise_array_from_level(cll.levenshtein_level("name_arr", 2), aggregate_fun="min")
or for equality (if for instance we need to count instances so array_length_intersection
is inappropriate):
cll.pairwise_array_level("name_arr", sql_snippet="{col_l} = {col_r}", aggregate_fun="any")
cll.pairwise_array_from_level(cll.exact_match_level("name_arr"))
Describe alternatives you've considered
Could have specialised versions - e.g. one for Levenshtein, one for equality, etc, but a suitably flexible option probably covers more use-cases without providing an overwhelming number of new levels.
Additional context
Might need some thought about range of use-cases + the best interface for this.
Explore functionality in DuckDB
@RossKen Since you mention DuckDB, have you considered something like my suggestion to use in iterated list_reduce in DuckDB in issue #1994? Is that approach viable?
Yep it should be viable, see: https://github.com/moj-analytical-services/splink/issues/1994#issuecomment-1960938890
See https://github.com/moj-analytical-services/splink/pull/2195