Duke
Duke copied to clipboard
Make set comparators that actually work on sets
That is, instead of working on a set of the tokens in a string, work with multi-value properties, and compare the sets of values for the property. This needs some fundamental changes in how properties are compared.
Hi,
What do you have in mind ? Keep the best score against the whole list ?
Cheers, Yann
No, actually to compare the sets of values using Jaccard/Dice. Remember that Duke records can have multiple values for a single property. Thus, we can treat these as sets of values and compare the sets.
Maybe this is overly simplistic, but couldn't we just change the split function to use a configurable split-on value (rather than default to splitting on space)? So rather than splitting multi-values during the cleaning, it is done during comparison.