dedupe Bring back multiple matches now that we have SQL based blocking

When we tried this with the Kolb scheme it was not worth the complexity and proliferation of block keys #670. We shoudl try again with the sql based approach.

Sep 02 '20 13:09 fgregg

Hi, would the sql based approach mean that multiple matches will have the same cluster id when using many-to-one and many-to-many join constraints? Thanks

Oct 02 '20 09:10 MatthewPavia

no multiple matches is just about a blocking strategy not the final matching.

Oct 02 '20 16:10 fgregg

to do this, we would need to the kind of virtual compounding contemplated in #856. with that done, we could do multi matching like this.

component_idx	component_min	pred_length	value	record_id
0	1	2	'A'	1001
1	2	2	'foo'	1001
1	2	2	'baz'	1001
1	2	2	'bang'	1001
0	1	2	'A'	1002
1	2	2	'bar'	1002
0	1	2	'A'	1003
1	2	2	'foo'	1003
1	2	2	'bang'	1003

to get our blocked record pairs we would need sql that looked something like:

SELECT DISTINCT *
FROM
  (SELECT a.record_id,
          b.record_id
   FROM
     (SELECT a.record_id,
             b.record_id,
             pred_id,
             component_idx
      FROM blocking_map a
      INNER JOIN blocking_map b USING (pred_id,
                                       component_idx,
                                       value)
      WHERE a.record_id < b.record_id
      GROUP BY a.record_id,
               b.record_id,
               pred_id,
               component_idx
      HAVING COUNT(*) >= component_min) AS sufficient_match
   GROUP BY a.record_id,
            b.record_id,
            pred_id
   HAVING COUNT(DISTINCT component_idx) = pred_length) AS block_match

would have to be a very good predicate to make this worthwhile

Jan 20 '22 18:01 fgregg