blocklib icon indicating copy to clipboard operation
blocklib copied to clipboard

Add number of encodings in blocking metadata

Open joyceyuu opened this issue 3 years ago • 2 comments

Before computing similarity score computation and matching, we need to check if the number of encodings in blocking data is consistent with number of encodings in CLK data.

Currently we either load the whole JSON just to get the number of encodings or use ijson to iteratively count the number of encodings.

It would be better to store the count in the metadata and just read this metadata when needing the count.

joyceyuu avatar Jul 21 '20 00:07 joyceyuu

Anonlink client already does this, I'm not sure there needs to be any functionality added to blocklib.

cc @wilko77

hardbyte avatar Mar 07 '21 21:03 hardbyte

There are two different counts:

  • The count of encodings as produced by clkhash.
  • The count of encodings as referenced in the blocking data.

Anonlink-client writes the first count into the meta data. I think Joyce was talking about the latter one here.

Those two counts can be different, depending on the choice of blocking algorithm. Whereas some algorithms nicely map every entity in at least one block, P-Sig does not guarantee that, because of the filtering. For these probabilistic schemes we found it useful to sanity check. That is, count the entities that are part of at least one block. This way we get an understanding of how aggressively the algorithm filters big blocks. There is code somewhere in blocklib that does just that. This count allows you to compute the coverage of the blocking scheme (percentage of entities referenced in at least one block). High coverage is a necessary condition for good linkage results. - An entity, that is not referenced in any block will never be matched.

As coverage is crucial for linkage success it makes sense to expose this measure downstream.

wilko77 avatar Mar 07 '21 23:03 wilko77