sparker Concern with weight calculation using BLAST and entropies

This library is pretty incredible, just have a bit of a concern I wanted to report.

My use case is as follows:

Take 2 CSV's containing customer data that should contain 1 or more fields that are matchable (an identifier for example)

customers1.csv:

id	name	random_field_1	random_field_2	random_field_3	etc...
1	google	555-333-222	...	...	...
2	facebook	222-555-111	...	...	...
3	microsoft	333-111-888	...	...	...

customers2.csv:

identifier	customer_name	random_field_1	random_field_2	random_field_3	etc...
5	google inc	555	...	...	...
10	facebook corp	111	...	...	...
300	microsoft industries	555	...	...	...

create profiles
cluster_similar_attributes

[
    {'cluster_id': 1, 'keys': ['1_name', '2_customer_name'] 'entropy': 1.4},
    {'cluster_id': 2, 'keys': ['1_id', '2_id', '1_random_field_1', '2_random_field_1', '1_random_field_2', '2_random_field_2', (etc...)], 'entropy': 9.5}, 
]

create_block_clusters

[
{'block_id': 0, 'profiles': [{0}, {0}], 'entropy': 1.4, 'cluster_id': -1, 'blocking_key': ''}
{'block_id': 1, 'profiles': [{1,2}, {1}], 'entropy': 9.5, 'cluster_id': -1, 'blocking_key': ''}
],

block purging
block filtering
WNP I would get a few mis-matches because the weight of matches for cluster_id 2 would be greater than cluster_id 1. Assuming there's 100 rows in each profile and 100 being the separator id (200 profiles total) the output edges would look something like:

[[0, 100, 10.5]
[1, 101, 10.5]
[2, 101, 20.8]]

You will notice that the higher weight goes to the match that has the higher entropy. This doesn't seem correct to me since lower entropy should give higher weight.

Using standard library I was able to get around 80-90 perfect matches. Once I edited calc_weights function in common_node_pruning.py from calc_chi_square(...) * entropies[neighbor_id] to calc_chi_square(...) / entropies[neighbor_id] I was able to get 100 perfect 1to1 matches.

Does the division instead of multiplication here make sense, and is my assumption of lower entropy should be greater match correct?

Please let me know :)

Jul 12 '22 15:07 ZackMitkin

Hi, thanks for using our library and for sharing your ideas, every hint is welcome. For many datasets, we noticed that the higher number of distinct values gives a better blocking, and this is captured by entropy. But it might be for sure that for other datasets this is not the case.

The formula comes from the BLAST paper (http://www.vldb.org/pvldb/vol9/p1173-simonini.pdf). The idea is that finding something equal (i.e. if you use token blocking, a token shared by two records) in a cluster that has a high entropy (i.e. a lot of different tokens) is more meaningful rather finding correspondence in a cluster with low entropy (i.e. all the tokens are equal). For this reason, the weight of the cluster is multiplied by the weight of the edge.

Maybe, you can also try to fine-tuning BLAST by using the chi2divider parameter, increasing it will smooth the pruning.

Jul 12 '22 15:07 Gaglia88

Thanks for helping out,

I understand the logic of high entropy being more meaningful compared to low entropy and I do believe that makes sense given real-world entities.

For my test dataset, the only matchable field is the customer name with a lot of junk fields, I'm just trying to test the worse edge case scenario.

The issue is, I only get one cluster with fields of customer_name and name as low entropy. All other fields are bunched up into the default cluster with a high entropy, so the library is assigning higher weight to the matches of junk fields that shouldn't be relevant.

An example of matched tokens are last four digits of a phone number to a snippet of a uuid. phone_number: 123-3455-1234 uuid: 4ed5a34a-2d27-4e55-1234-a6785cb6c820 match: 1234

Is this just a drawback of using this approach? Maybe the edge case is unrealistic for real-world data?

I've tried tuning chi2divider but that seems to just increases the returned edges count. When I select by highest weight I'm left with inaccurate results.

My customer names should be a perfect 1-1 match between the two files. Maybe the clustering step should ignore completely irrelevant fields if there's way too few matches? What do you think?

Jul 12 '22 16:07 ZackMitkin

sparker sparker copied to clipboard

Concern with weight calculation using BLAST and entropies

sparker
sparker copied to clipboard