virtual compound predicate
right now we a compound predicate creates compound blocks.
as we are moving to sql approach, we could first create block_key entries for the elemental predicates and combine them in to compound blocks in the query.
this could lead to
- less computational work, if compound predicates include some of the same elemental blocks
- smaller blocking maps, since compound predicates will produce the cartesian product of the elemental blocks. if more than one elemental block returns more than blocking_key, our current process will produce many blocking_keys
- might open the door for more sophisticated predicates
On the other hand, this will probably be harder for folks to implement themselves if they are using their own database.
if we did this we would probably normalize the block table to look something like this
| pred_id | component_idx | pred_length | value | record_id |
|---|---|---|---|---|
| 0 | 0 | 2 | 'A' | 1001 |
| 0 | 1 | 2 | 'foo' | 1001 |
| 0 | 0 | 2 | 'A' | 1002 |
| 0 | 1 | 2 | 'bar' | 1002 |
| 0 | 0 | 2 | 'A' | 1003 |
| 0 | 1 | 2 | 'foo' | 1003 |
to get our blocked record pairs we would need sql that looked something like:
SELECT DISTINCT *
FROM
(SELECT a.record_id,
b.record_id
FROM blocking_map a
INNER JOIN blocking_map b USING (pred_id,
component_idx,
value)
WHERE a.record_id < b.record_id
GROUP BY a.record_id,
b.record_id,
pred_id
HAVING COUNT(DISTINCT component_idx) = pred_length) AS block_match
seems very likely that the overhead of that query would wipe out any other performance benefits.
playing around with this and the virtual compound blocking map table is often larger then the current style because we usually have compound predicates that have very low cardinality
our current method of creating block keys product of the individual components will be smaller than the virtual strategy when
- one of the components is empty (0, ...)
- when all the components except one are one (1, 1, ..., 1, x)
- when the components are like (2, 2, 1) or (2, 2, 1, ... 1)
the product method will be the same size as the virtual strategy when
- there is only one component
- there are two components of cardinality 2 (2, 2)
the product method will be larger when
- at least two elements are larger than 2
- at least three elements are larger than 1
it's a bit disappointing, but not quite ready to close this.
the fact that we have lots of low-cardinality predicates has been shaped by having blocking methods that made high-cardinality predicates painful. also, moving in this direction would unlock other interesting features.
if there is atomic predicate reuse across compound predicates that could make things more attractive.
the blocking map would just be the atomic predicate keys and then there would be another table that mapped atomic predicates to compound predicates