Bring back multiple matches now that we have SQL based blocking
When we tried this with the Kolb scheme it was not worth the complexity and proliferation of block keys #670. We shoudl try again with the sql based approach.
Hi, would the sql based approach mean that multiple matches will have the same cluster id when using many-to-one and many-to-many join constraints? Thanks
no multiple matches is just about a blocking strategy not the final matching.
to do this, we would need to the kind of virtual compounding contemplated in #856. with that done, we could do multi matching like this.
| pred_id | component_idx | component_min | pred_length | value | record_id |
|---|---|---|---|---|---|
| 0 | 0 | 1 | 2 | 'A' | 1001 |
| 0 | 1 | 2 | 2 | 'foo' | 1001 |
| 0 | 1 | 2 | 2 | 'baz' | 1001 |
| 0 | 1 | 2 | 2 | 'bang' | 1001 |
| 0 | 0 | 1 | 2 | 'A' | 1002 |
| 0 | 1 | 2 | 2 | 'bar' | 1002 |
| 0 | 0 | 1 | 2 | 'A' | 1003 |
| 0 | 1 | 2 | 2 | 'foo' | 1003 |
| 0 | 1 | 2 | 2 | 'bang' | 1003 |
to get our blocked record pairs we would need sql that looked something like:
SELECT DISTINCT *
FROM
(SELECT a.record_id,
b.record_id
FROM
(SELECT a.record_id,
b.record_id,
pred_id,
component_idx
FROM blocking_map a
INNER JOIN blocking_map b USING (pred_id,
component_idx,
value)
WHERE a.record_id < b.record_id
GROUP BY a.record_id,
b.record_id,
pred_id,
component_idx
HAVING COUNT(*) >= component_min) AS sufficient_match
GROUP BY a.record_id,
b.record_id,
pred_id
HAVING COUNT(DISTINCT component_idx) = pred_length) AS block_match
would have to be a very good predicate to make this worthwhile