clkhash icon indicating copy to clipboard operation
clkhash copied to clipboard

Handle hashing missing values

Open hardbyte opened this issue 6 years ago • 1 comments

Say a row doesn't have data for one field:

INDEX,NAME freetext,DOB YYYY/MM/DD,GENDER M or F
0,Libby Slemmer,1933/09/13,F
1,Garold Staten,,M
2,Yaritza Edman,1972/11/30,

What should we do?

  1. Current approach is still creating a CLK for the record, it will either be hashing an empty string or skipping that feature meaning less bits get set which means it might not be considered a match.
  2. We could drop the row and locally output a list of entities that were dropped.
  3. We could throw an error and leave it up to the user

In any case I think we should decide what approach is best and document our decision in the docs.

Aha! Link: https://csiro.aha.io/features/ANONLINK-55

hardbyte avatar Jul 12 '17 00:07 hardbyte

We could have a bit mask that shows which of the fields were present (relative to the input schema) so that we would know on subsequent processing that the match probabilities need to be interpreted differently, or return with the probability how many parts of the schema where not matched.

sjhardy avatar Oct 11 '17 04:10 sjhardy