Brian Thorne
Brian Thorne
Hypothesis thinks it has found a flaky test: ``` =================================== FAILURES =================================== ______________ test_bytes_bitarray_agree[dice_coefficient_python] ______________ sim_fun = @given(strategies.data(), strategies.floats(min_value=0, max_value=1)) > @pytest.mark.parametrize('sim_fun', SIM_FUNS) def test_bytes_bitarray_agree(sim_fun, data, threshold): /project/tests/test_similarity_dice.py:289: _ _...
Optional interface with cuda. Note we have a proof of concept for computing the DICE-Sorensen index, sorting and applying a threshold all on the GPU. Need to consider whether to...
Our greedy algorithm currently fails matching the following graph, where the connection between a and 1 looks likely, but ultimately shouldn't be chosen.  The network methods should succeed, and...
It may make sense to calculate multiple CLKs using different field sets for improved matching, blocking, matching with orgs who only have a subset of the fields, and most importantly...
Consider adding a `train` function that would be provided with training data - CLKs that are known to match. The output would be an optimal threshold `t`. This would be...
Yangfeng suggested looking at [febrl](https://github.com/fgregg/febrl) to generate data with pertubations. Manual - http://users.cecs.anu.edu.au/~Peter.Christen/Febrl/febrl-0.3/febrldoc-0.3/ Additional test sets: - http://www.record-linkage.de/-Resources--other_record_linkage_resources.htm#recordlinkagetestdata - https://espace.curtin.edu.au/handle/20.500.11937/26908 Aha! Link: https://csiro.aha.io/features/ANONLINK-76
Run Memcheck on the binary and fail if it detects a memory leak or bad access. Aha! Link: https://csiro.aha.io/features/ANONLINK-75
Currently a few e2e tests rely on clkhash for generating the bloomfilters. While it is nice to do an integration test between the related libraries we shouldn't introduce a clkhash...
When creating the bi-grams, the first and last bi-gram are padded with a whitespace. This is a weakness, because it allows an attacker to more easily to find the beginning...
For testing purposes it would be useful to have a compatible jar built of this library that uses javallier mock context.