sparker icon indicating copy to clipboard operation
sparker copied to clipboard

Concern with weight calculation using BLAST and entropies

Open ZackMitkin opened this issue 2 years ago • 2 comments

This library is pretty incredible, just have a bit of a concern I wanted to report.

My use case is as follows:

Take 2 CSV's containing customer data that should contain 1 or more fields that are matchable (an identifier for example)

customers1.csv:

id name random_field_1 random_field_2 random_field_3 etc...
1 google 555-333-222 ... ... ...
2 facebook 222-555-111 ... ... ...
3 microsoft 333-111-888 ... ... ...

customers2.csv:

identifier customer_name random_field_1 random_field_2 random_field_3 etc...
5 google inc 555 ... ... ...
10 facebook corp 111 ... ... ...
300 microsoft industries 555 ... ... ...
  1. create profiles
  2. cluster_similar_attributes
[
    {'cluster_id': 1, 'keys': ['1_name', '2_customer_name'] 'entropy': 1.4},
    {'cluster_id': 2, 'keys': ['1_id', '2_id', '1_random_field_1', '2_random_field_1', '1_random_field_2', '2_random_field_2', (etc...)], 'entropy': 9.5}, 
]
  1. create_block_clusters
[
{'block_id': 0, 'profiles': [{0}, {0}], 'entropy': 1.4, 'cluster_id': -1, 'blocking_key': ''}
{'block_id': 1, 'profiles': [{1,2}, {1}], 'entropy': 9.5, 'cluster_id': -1, 'blocking_key': ''}
], 
  1. block purging
  2. block filtering
  3. WNP I would get a few mis-matches because the weight of matches for cluster_id 2 would be greater than cluster_id 1. Assuming there's 100 rows in each profile and 100 being the separator id (200 profiles total) the output edges would look something like:
[[0, 100, 10.5]
[1, 101, 10.5]
[2, 101, 20.8]]

You will notice that the higher weight goes to the match that has the higher entropy. This doesn't seem correct to me since lower entropy should give higher weight.

Using standard library I was able to get around 80-90 perfect matches. Once I edited calc_weights function in common_node_pruning.py from calc_chi_square(...) * entropies[neighbor_id] to calc_chi_square(...) / entropies[neighbor_id] I was able to get 100 perfect 1to1 matches.

Does the division instead of multiplication here make sense, and is my assumption of lower entropy should be greater match correct?

Please let me know :)

ZackMitkin avatar Jul 12 '22 15:07 ZackMitkin

Hi, thanks for using our library and for sharing your ideas, every hint is welcome. For many datasets, we noticed that the higher number of distinct values gives a better blocking, and this is captured by entropy. But it might be for sure that for other datasets this is not the case.

The formula comes from the BLAST paper (http://www.vldb.org/pvldb/vol9/p1173-simonini.pdf). The idea is that finding something equal (i.e. if you use token blocking, a token shared by two records) in a cluster that has a high entropy (i.e. a lot of different tokens) is more meaningful rather finding correspondence in a cluster with low entropy (i.e. all the tokens are equal). For this reason, the weight of the cluster is multiplied by the weight of the edge.

Maybe, you can also try to fine-tuning BLAST by using the chi2divider parameter, increasing it will smooth the pruning.

Gaglia88 avatar Jul 12 '22 15:07 Gaglia88

Thanks for helping out,

I understand the logic of high entropy being more meaningful compared to low entropy and I do believe that makes sense given real-world entities.

For my test dataset, the only matchable field is the customer name with a lot of junk fields, I'm just trying to test the worse edge case scenario.

The issue is, I only get one cluster with fields of customer_name and name as low entropy. All other fields are bunched up into the default cluster with a high entropy, so the library is assigning higher weight to the matches of junk fields that shouldn't be relevant.

An example of matched tokens are last four digits of a phone number to a snippet of a uuid. phone_number: 123-3455-1234 uuid: 4ed5a34a-2d27-4e55-1234-a6785cb6c820 match: 1234

Is this just a drawback of using this approach? Maybe the edge case is unrealistic for real-world data?

I've tried tuning chi2divider but that seems to just increases the returned edges count. When I select by highest weight I'm left with inaccurate results.

My customer names should be a perfect 1-1 match between the two files. Maybe the clustering step should ignore completely irrelevant fields if there's way too few matches? What do you think?

ZackMitkin avatar Jul 12 '22 16:07 ZackMitkin