clkhash
clkhash copied to clipboard
Handle hashing missing values
Say a row doesn't have data for one field:
INDEX,NAME freetext,DOB YYYY/MM/DD,GENDER M or F
0,Libby Slemmer,1933/09/13,F
1,Garold Staten,,M
2,Yaritza Edman,1972/11/30,
What should we do?
- Current approach is still creating a CLK for the record, it will either be hashing an empty string or skipping that feature meaning less bits get set which means it might not be considered a match.
- We could drop the row and locally output a list of entities that were dropped.
- We could throw an error and leave it up to the user
In any case I think we should decide what approach is best and document our decision in the docs.
Aha! Link: https://csiro.aha.io/features/ANONLINK-55
We could have a bit mask that shows which of the fields were present (relative to the input schema) so that we would know on subsequent processing that the match probabilities need to be interpreted differently, or return with the probability how many parts of the schema where not matched.