dedupe icon indicating copy to clipboard operation
dedupe copied to clipboard

Bring back multiple matches now that we have SQL based blocking

Open fgregg opened this issue 5 years ago • 3 comments

When we tried this with the Kolb scheme it was not worth the complexity and proliferation of block keys #670. We shoudl try again with the sql based approach.

fgregg avatar Sep 02 '20 13:09 fgregg

Hi, would the sql based approach mean that multiple matches will have the same cluster id when using many-to-one and many-to-many join constraints? Thanks

MatthewPavia avatar Oct 02 '20 09:10 MatthewPavia

no multiple matches is just about a blocking strategy not the final matching.

fgregg avatar Oct 02 '20 16:10 fgregg

to do this, we would need to the kind of virtual compounding contemplated in #856. with that done, we could do multi matching like this.

pred_id component_idx component_min pred_length value record_id
0 0 1 2 'A' 1001
0 1 2 2 'foo' 1001
0 1 2 2 'baz' 1001
0 1 2 2 'bang' 1001
0 0 1 2 'A' 1002
0 1 2 2 'bar' 1002
0 0 1 2 'A' 1003
0 1 2 2 'foo' 1003
0 1 2 2 'bang' 1003

to get our blocked record pairs we would need sql that looked something like:

SELECT DISTINCT *
FROM
  (SELECT a.record_id,
          b.record_id
   FROM
     (SELECT a.record_id,
             b.record_id,
             pred_id,
             component_idx
      FROM blocking_map a
      INNER JOIN blocking_map b USING (pred_id,
                                       component_idx,
                                       value)
      WHERE a.record_id < b.record_id
      GROUP BY a.record_id,
               b.record_id,
               pred_id,
               component_idx
      HAVING COUNT(*) >= component_min) AS sufficient_match
   GROUP BY a.record_id,
            b.record_id,
            pred_id
   HAVING COUNT(DISTINCT component_idx) = pred_length) AS block_match

would have to be a very good predicate to make this worthwhile

fgregg avatar Jan 20 '22 18:01 fgregg