Brian Thorne

Results 125 issues of Brian Thorne

Hypothesis thinks it has found a flaky test: ``` =================================== FAILURES =================================== ______________ test_bytes_bitarray_agree[dice_coefficient_python] ______________ sim_fun = @given(strategies.data(), strategies.floats(min_value=0, max_value=1)) > @pytest.mark.parametrize('sim_fun', SIM_FUNS) def test_bytes_bitarray_agree(sim_fun, data, threshold): /project/tests/test_similarity_dice.py:289: _ _...

state: Need more information
P5: low

Optional interface with cuda. Note we have a proof of concept for computing the DICE-Sorensen index, sorting and applying a threshold all on the GPU. Need to consider whether to...

enhancement

Our greedy algorithm currently fails matching the following graph, where the connection between a and 1 looks likely, but ultimately shouldn't be chosen. ![4048684e-3882-11e6-9a81-105da6c927bd](https://user-images.githubusercontent.com/855189/36401446-a28157e0-162b-11e8-8c09-2217f51ec030.png) The network methods should succeed, and...

It may make sense to calculate multiple CLKs using different field sets for improved matching, blocking, matching with orgs who only have a subset of the fields, and most importantly...

proposal

Consider adding a `train` function that would be provided with training data - CLKs that are known to match. The output would be an optimal threshold `t`. This would be...

enhancement

Yangfeng suggested looking at [febrl](https://github.com/fgregg/febrl) to generate data with pertubations. Manual - http://users.cecs.anu.edu.au/~Peter.Christen/Febrl/febrl-0.3/febrldoc-0.3/ Additional test sets: - http://www.record-linkage.de/-Resources--other_record_linkage_resources.htm#recordlinkagetestdata - https://espace.curtin.edu.au/handle/20.500.11937/26908 Aha! Link: https://csiro.aha.io/features/ANONLINK-76

Run Memcheck on the binary and fail if it detects a memory leak or bad access. Aha! Link: https://csiro.aha.io/features/ANONLINK-75

state: Blocked

Currently a few e2e tests rely on clkhash for generating the bloomfilters. While it is nice to do an integration test between the related libraries we shouldn't introduce a clkhash...

When creating the bi-grams, the first and last bi-gram are padded with a whitespace. This is a weakness, because it allows an attacker to more easily to find the beginning...

research
security

For testing purposes it would be useful to have a compatible jar built of this library that uses javallier mock context.

enhancement