splink icon indicating copy to clipboard operation
splink copied to clipboard

[FEAT] Pairwise array comparison levels

Open ADBond opened this issue 1 year ago • 4 comments

Is your proposal related to a problem?

We have the udf dualarrayexplode which gives us the Cartesian product of two arrays which we sometimes use for comparison levels. Currently we hand-roll the SQL, but it would be useful to have a shortcut for this.

Describe the solution you'd like

Would be nice to have a couple of ways to construct this - using string templates (or something similar), or from comparison levels.

For instance, pairwise levenshtein comparison level (minimum levenshtein distance between arrays less than 2)

# for each x in name_arr_l, y in name_arr_r, condition is min(levenshtein(x, y)) <= 2
cll.pairwise_array_level("name_arr", sql_snippet="levenshtein({col_l}, {col_r})", aggregate_fun="min", threshold=2)

or the same thing from a comparison level:

# for each x in name_arr_l, y in name_arr_r, condition is min(levenshtein(x, y)) <= 2
cll.pairwise_array_from_level(cll.levenshtein_level("name_arr", 2), aggregate_fun="min")

or for equality (if for instance we need to count instances so array_length_intersection is inappropriate):

cll.pairwise_array_level("name_arr", sql_snippet="{col_l} = {col_r}", aggregate_fun="any")
cll.pairwise_array_from_level(cll.exact_match_level("name_arr"))

Describe alternatives you've considered

Could have specialised versions - e.g. one for Levenshtein, one for equality, etc, but a suitably flexible option probably covers more use-cases without providing an overwhelming number of new levels.

Additional context

Might need some thought about range of use-cases + the best interface for this.

ADBond avatar Jun 14 '23 15:06 ADBond

Explore functionality in DuckDB

RossKen avatar Sep 19 '23 14:09 RossKen

@RossKen Since you mention DuckDB, have you considered something like my suggestion to use in iterated list_reduce in DuckDB in issue #1994? Is that approach viable?

samkodes avatar Feb 23 '24 02:02 samkodes

Yep it should be viable, see: https://github.com/moj-analytical-services/splink/issues/1994#issuecomment-1960938890

RobinL avatar Feb 23 '24 08:02 RobinL

See https://github.com/moj-analytical-services/splink/pull/2195

RobinL avatar Jul 24 '24 18:07 RobinL