sparker
sparker copied to clipboard
Concern with weight calculation using BLAST and entropies
This library is pretty incredible, just have a bit of a concern I wanted to report.
My use case is as follows:
Take 2 CSV's containing customer data that should contain 1 or more fields that are matchable (an identifier for example)
customers1.csv:
id | name | random_field_1 | random_field_2 | random_field_3 | etc... |
---|---|---|---|---|---|
1 | 555-333-222 | ... | ... | ... | |
2 | 222-555-111 | ... | ... | ... | |
3 | microsoft | 333-111-888 | ... | ... | ... |
customers2.csv:
identifier | customer_name | random_field_1 | random_field_2 | random_field_3 | etc... |
---|---|---|---|---|---|
5 | google inc | 555 | ... | ... | ... |
10 | facebook corp | 111 | ... | ... | ... |
300 | microsoft industries | 555 | ... | ... | ... |
- create profiles
- cluster_similar_attributes
[
{'cluster_id': 1, 'keys': ['1_name', '2_customer_name'] 'entropy': 1.4},
{'cluster_id': 2, 'keys': ['1_id', '2_id', '1_random_field_1', '2_random_field_1', '1_random_field_2', '2_random_field_2', (etc...)], 'entropy': 9.5},
]
- create_block_clusters
[
{'block_id': 0, 'profiles': [{0}, {0}], 'entropy': 1.4, 'cluster_id': -1, 'blocking_key': ''}
{'block_id': 1, 'profiles': [{1,2}, {1}], 'entropy': 9.5, 'cluster_id': -1, 'blocking_key': ''}
],
- block purging
- block filtering
- WNP
I would get a few mis-matches because the weight of matches for
cluster_id
2 would be greater thancluster_id
1. Assuming there's 100 rows in each profile and 100 being the separator id (200 profiles total) the output edges would look something like:
[[0, 100, 10.5]
[1, 101, 10.5]
[2, 101, 20.8]]
You will notice that the higher weight goes to the match that has the higher entropy. This doesn't seem correct to me since lower entropy should give higher weight.
Using standard library I was able to get around 80-90 perfect matches. Once I edited calc_weights
function in common_node_pruning.py
from calc_chi_square(...) * entropies[neighbor_id]
to calc_chi_square(...) / entropies[neighbor_id]
I was able to get 100 perfect 1to1 matches.
Does the division instead of multiplication here make sense, and is my assumption of lower entropy should be greater match correct?
Please let me know :)
Hi, thanks for using our library and for sharing your ideas, every hint is welcome. For many datasets, we noticed that the higher number of distinct values gives a better blocking, and this is captured by entropy. But it might be for sure that for other datasets this is not the case.
The formula comes from the BLAST paper (http://www.vldb.org/pvldb/vol9/p1173-simonini.pdf). The idea is that finding something equal (i.e. if you use token blocking, a token shared by two records) in a cluster that has a high entropy (i.e. a lot of different tokens) is more meaningful rather finding correspondence in a cluster with low entropy (i.e. all the tokens are equal). For this reason, the weight of the cluster is multiplied by the weight of the edge.
Maybe, you can also try to fine-tuning BLAST by using the chi2divider parameter, increasing it will smooth the pruning.
Thanks for helping out,
I understand the logic of high entropy being more meaningful compared to low entropy and I do believe that makes sense given real-world entities.
For my test dataset, the only matchable field is the customer name with a lot of junk fields, I'm just trying to test the worse edge case scenario.
The issue is, I only get one cluster with fields of customer_name
and name
as low entropy. All other fields are bunched up into the default cluster with a high entropy, so the library is assigning higher weight to the matches of junk fields that shouldn't be relevant.
An example of matched tokens are last four digits of a phone number to a snippet of a uuid. phone_number: 123-3455-1234 uuid: 4ed5a34a-2d27-4e55-1234-a6785cb6c820 match: 1234
Is this just a drawback of using this approach? Maybe the edge case is unrealistic for real-world data?
I've tried tuning chi2divider but that seems to just increases the returned edges count. When I select by highest weight I'm left with inaccurate results.
My customer names should be a perfect 1-1 match between the two files. Maybe the clustering step should ignore completely irrelevant fields if there's way too few matches? What do you think?