dedupe icon indicating copy to clipboard operation
dedupe copied to clipboard

virtual compound predicate

Open fgregg opened this issue 5 years ago • 3 comments

right now we a compound predicate creates compound blocks.

as we are moving to sql approach, we could first create block_key entries for the elemental predicates and combine them in to compound blocks in the query.

this could lead to

  1. less computational work, if compound predicates include some of the same elemental blocks
  2. smaller blocking maps, since compound predicates will produce the cartesian product of the elemental blocks. if more than one elemental block returns more than blocking_key, our current process will produce many blocking_keys
  3. might open the door for more sophisticated predicates

On the other hand, this will probably be harder for folks to implement themselves if they are using their own database.

fgregg avatar Sep 07 '20 13:09 fgregg

if we did this we would probably normalize the block table to look something like this

pred_id component_idx pred_length value record_id
0 0 2 'A' 1001
0 1 2 'foo' 1001
0 0 2 'A' 1002
0 1 2 'bar' 1002
0 0 2 'A' 1003
0 1 2 'foo' 1003

to get our blocked record pairs we would need sql that looked something like:

SELECT DISTINCT *
FROM
  (SELECT a.record_id,
          b.record_id
   FROM blocking_map a
   INNER JOIN blocking_map b USING (pred_id,
                                    component_idx,
                                    value)
   WHERE a.record_id < b.record_id
   GROUP BY a.record_id,
            b.record_id,
            pred_id
   HAVING COUNT(DISTINCT component_idx) = pred_length) AS block_match

seems very likely that the overhead of that query would wipe out any other performance benefits.

fgregg avatar Jan 20 '22 16:01 fgregg

playing around with this and the virtual compound blocking map table is often larger then the current style because we usually have compound predicates that have very low cardinality

our current method of creating block keys product of the individual components will be smaller than the virtual strategy when

  1. one of the components is empty (0, ...)
  2. when all the components except one are one (1, 1, ..., 1, x)
  3. when the components are like (2, 2, 1) or (2, 2, 1, ... 1)

the product method will be the same size as the virtual strategy when

  1. there is only one component
  2. there are two components of cardinality 2 (2, 2)

the product method will be larger when

  1. at least two elements are larger than 2
  2. at least three elements are larger than 1

it's a bit disappointing, but not quite ready to close this.

the fact that we have lots of low-cardinality predicates has been shaped by having blocking methods that made high-cardinality predicates painful. also, moving in this direction would unlock other interesting features.

fgregg avatar Sep 24 '22 13:09 fgregg

if there is atomic predicate reuse across compound predicates that could make things more attractive.

the blocking map would just be the atomic predicate keys and then there would be another table that mapped atomic predicates to compound predicates

fgregg avatar Sep 24 '22 14:09 fgregg