clkhash
clkhash copied to clipboard
CLK hash: hash pii for entity matching
I propose removing the dependency on Bitarray and using bitwise operations on `int`s instead. I see no good reason to use `bitarray`. The only two operations we use on it...
Ideas: - version of clkhash used - size and statistics of clks - schema (or hash of schema) - hash of clks - timestamp (when the PII was encoded)
While reading [Options for encoding names for data linking at the Australian Bureau of Statistics](https://arxiv.org/abs/1802.07975) I came across this note regarding restrictions on the bloom filter's modulus:  Aha! Link: https://csiro.aha.io/features/ANONLINK-39
Users trying to use clkhash have ran into issues with `head` and with the multiline commands separated with `/`