splink [FEAT] Pairwise array comparison levels

Is your proposal related to a problem?

We have the udf dualarrayexplode which gives us the Cartesian product of two arrays which we sometimes use for comparison levels. Currently we hand-roll the SQL, but it would be useful to have a shortcut for this.

Describe the solution you'd like

Would be nice to have a couple of ways to construct this - using string templates (or something similar), or from comparison levels.

For instance, pairwise levenshtein comparison level (minimum levenshtein distance between arrays less than 2)

# for each x in name_arr_l, y in name_arr_r, condition is min(levenshtein(x, y)) <= 2
cll.pairwise_array_level("name_arr", sql_snippet="levenshtein({col_l}, {col_r})", aggregate_fun="min", threshold=2)

or the same thing from a comparison level:

# for each x in name_arr_l, y in name_arr_r, condition is min(levenshtein(x, y)) <= 2
cll.pairwise_array_from_level(cll.levenshtein_level("name_arr", 2), aggregate_fun="min")

or for equality (if for instance we need to count instances so array_length_intersection is inappropriate):

cll.pairwise_array_level("name_arr", sql_snippet="{col_l} = {col_r}", aggregate_fun="any")
cll.pairwise_array_from_level(cll.exact_match_level("name_arr"))

Describe alternatives you've considered

Could have specialised versions - e.g. one for Levenshtein, one for equality, etc, but a suitably flexible option probably covers more use-cases without providing an overwhelming number of new levels.

Additional context

Might need some thought about range of use-cases + the best interface for this.

Jun 14 '23 15:06 ADBond

Explore functionality in DuckDB

Sep 19 '23 14:09 RossKen

@RossKen Since you mention DuckDB, have you considered something like my suggestion to use in iterated list_reduce in DuckDB in issue #1994? Is that approach viable?

Feb 23 '24 02:02 samkodes

Yep it should be viable, see: https://github.com/moj-analytical-services/splink/issues/1994#issuecomment-1960938890

Feb 23 '24 08:02 RobinL

See https://github.com/moj-analytical-services/splink/pull/2195

Jul 24 '24 18:07 RobinL

splink splink copied to clipboard

[FEAT] Pairwise array comparison levels

Is your proposal related to a problem?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

splink
splink copied to clipboard