VerticaPy icon indicating copy to clipboard operation
VerticaPy copied to clipboard

Jaro-Winkler distance

Open gaetan-dion opened this issue 2 years ago • 1 comments

Hi,

In several project, we would use Jaro-Winkler distance :

This method is implemented in Jellyfish library, and we would find this interesting to add this method to Vertica and/or VerticaPy.
Because this method is expensive to execute on only one node, this calculation have to found all matches and transpositions between 2 strings.
We know Vertica already have levenshtein distance, but Jaro-Winkler give good results also, and furthermore its result is normalized between 0 and 1, which make easier comparison and interpretation.

Jaro-Winkler is used in several use cases, to compare 2 strings, for :

  • Detect duplicates values (as mistyped names...)
  • To replace strings by normalized strings (like compagny names...), which permit to made a join with external referentials as INSEE
  • ....

gaetan-dion avatar Jun 07 '22 07:06 gaetan-dion

Jaro Winkler is on its way. It should be soon available.

oualib avatar Aug 15 '22 16:08 oualib