clkhash
clkhash copied to clipboard
CLK hash: hash pii for entity matching
I propose removing the dependency on Bitarray and using bitwise operations on `int`s instead. I see no good reason to use `bitarray`. The only two operations we use on it...
Ideas: - version of clkhash used - size and statistics of clks - schema (or hash of schema) - hash of clks - timestamp (when the PII was encoded)
While reading [Options for encoding names for data linking at the Australian Bureau of Statistics](https://arxiv.org/abs/1802.07975) I came across this note regarding restrictions on the bloom filter's modulus: ![screenshot from 2018-02-24...
Consider if the right levels of abstraction have been made for a library user and document options to improve. It should be relatively easy for a clkhash user to define...
An experimental api has been added for uploading CLKs as a binary file. This is to allow for faster and more efficient data transfer. The same rest endpoint (`/projects/{project_id}/clks`) is...
Add a page to the docs with information about supported platforms including any special instructions on how to install dependencies e.g. Visual Studio C++ compiler on Windows. Perhaps worth looking...
We should rethink defaults as currently: * `clkhash` ignores the values in the spec * the defaults are spread throughout the code base. Either hard-coded (e.g. schema.py line 184), default...
Consider applying [black](https://github.com/ambv/black) Aha! Link: https://csiro.aha.io/features/ANONLINK-39
Users trying to use clkhash have ran into issues with `head` and with the multiline commands separated with `/`