clkhash
clkhash copied to clipboard
CLK hash: hash pii for entity matching
There's no reason not to allow a field to be processed more than once with different tokenization and hashing. V2 schema can represent this, but current code can't handle it....
It would be a good idea to make it clear how use the library without serialization e.g. to directly use the clkhash output with anonlink. There are a few functions...
Say a row doesn't have data for one field: ``` INDEX,NAME freetext,DOB YYYY/MM/DD,GENDER M or F 0,Libby Slemmer,1933/09/13,F 1,Garold Staten,,M 2,Yaritza Edman,1972/11/30, ``` What should we do? 1) Current approach...
Aha! Link: https://csiro.aha.io/features/ANONLINK-50
We need to add a note on security... The Cryptographic Longterm Key is computed and compared following the method described by Rainer Schnell, Tobias Bachteler, and Jörg Reiher in [A...
Currently, the defaults are embedded in the code. This is in addition to them being listed in the master schema. This can lead to inconsistencies if the defaults are changed...
In literature, the length of a CLK _l_ is either fixed to 1000 or 100. Depending on who is writing the paper. I read somewhere (unfortunately I cannot find it...
> Assuming clk.py is meant to be common code that could support a number of different interfaces, then tqdm’s progress bars (which are specific to a CLI) should be handled...
The readme or docs should state how we configure and run mypy. Aha! Link: https://csiro.aha.io/features/ANONLINK-37
the error you will get looks like this: ``` --------------------------------------------------------------------------- StopIteration Traceback (most recent call last) in () 1 from clkhash import clk ----> 2 hashed_data_a = clk.generate_clk_from_csv(a_csv, ('key1', 'key2'),...