anonlink-entity-service icon indicating copy to clipboard operation
anonlink-entity-service copied to clipboard

Thoughs and proposal on improving the linkage workflow on the Entity Server

Open wilko77 opened this issue 5 years ago • 0 comments

Currently, the analyst has to define a threshold 'k' when creating a run, which is then used to compute the similarity matrix. Bad choices for the threshold can lead to huge similarity matrices, which in turn put a lot of stress on the resources of the server. There is no value in massively over-defined similarity matrices, as it just lead to an inflation of false positive matches.

Unfortunately, a good choice for the threshold 'k' depends heavily on the provided CLKs. As the analyst has no access to the CLKs themselves, he essentially has to take stabs in the dark.

The server can aid the analyst in finding an appropriate threshold.

The table below shows an experiment on the NC voter data. There are 50000 CLKs from each party, and they contain 33122 matches. The "loading" factor is a proxy for the "sparseness" of the similarity matrix:

  size of similarity matrix = loading factor * 50000
threshold loading precision recall accuracy
0.7 8949.32 0.439 0.996 0.439
0.75 340.53 0.457 0.962 0.458
0.8 14.44 0.657 0.935 0.684
0.83 2.83 0.878 0.961 0.890
0.85 1.01 0.948 0.969 0.945
0.87 0.71 0.977 0.944 0.948
0.9 0.56 0.995 0.823 0.880

And this table shows the same values computed with a bad schema:

threshold loading precision recall accuracy
0.7 151.91 0.309 0.823 0.337
0.75 2.77 0.527 0.520 0.514
0.8 0.25 0.826 0.245 0.493
0.83 0.15 0.901 0.184 0.463
0.85 0.12 0.933 0.164 0.451
0.87 0.10 0.960 0.151 0.444
0.9 0.09 0.993 0.133 0.433

Now back to the "good" schema. If we look at the histogram of the entries in the similarity matrix, we can see this interesting little bump in the tail. That's where most of the correct matches hang out. Indeed, if we overlay the F1 score, a combined measure of precision and recall (1.0 is great, 0 is bad), we can see that if we choose the threshold in such a way, that it contains most of the shallow tail, we maximize the F1 score.

image

However, the same plot with the bad schema CLKs tells a different story.

image

First, the F1 scores are pretty bad in general, and the maximum F1 score is shifted into the 'heavy' side of the histogram. That is because there are not many entries in the similarity matrix with high values.

In fact, if we print the cumulative histograms - how many values in the similarity matrix are above this threshold - and add a line for the number of actual matches in the dataset (green), we see a big difference between "good" and "bad" schema CLKs.

image

image

Proposed changes to the ES workflow

Given the evidence above, I propose the following changes to how we do PPRL in the ES:

  • in order to protect the resources of the server, we should not save unnecessarily large similarity matrices. Something along the line of letting the analyst define a 'loading factor' which defaults to something small-ish (maybe 10). While we compute all the similarity scores, we can estimate the choice of 'k' such that we end up with a similarity matrix of appropriate size.

  • after we compute the similarity matrix, we should allow the analyst to see some statistics to enable him to choose a sensible threshold for the solver.

    • the histogram gives a good indication of what a good threshold for the solver should be
    • the cumulative histogram shows the analyst an upper bound of the possible matches, which he can then compare to the expected number of matches. The difference is somewhat indicative of the suitability of the CLKs for the given task.
  • move the threshold to the solver. This enables us to reuse the same similarity matrix for different solver runs, which in turn reduces resources consumption on the server.

wilko77 avatar Jul 31 '18 05:07 wilko77