Duke icon indicating copy to clipboard operation
Duke copied to clipboard

Make set comparators that actually work on sets

Open larsga opened this issue 10 years ago • 3 comments

That is, instead of working on a set of the tokens in a string, work with multi-value properties, and compare the sets of values for the property. This needs some fundamental changes in how properties are compared.

larsga avatar Aug 01 '14 06:08 larsga

Hi,

What do you have in mind ? Keep the best score against the whole list ?

Cheers, Yann

YannBrrd avatar Aug 01 '14 15:08 YannBrrd

No, actually to compare the sets of values using Jaccard/Dice. Remember that Duke records can have multiple values for a single property. Thus, we can treat these as sets of values and compare the sets.

larsga avatar Aug 01 '14 16:08 larsga

Maybe this is overly simplistic, but couldn't we just change the split function to use a configurable split-on value (rather than default to splitting on space)? So rather than splitting multi-values during the cleaning, it is done during comparison.

ztsmith avatar Aug 11 '14 04:08 ztsmith