reach icon indicating copy to clipboard operation
reach copied to clipboard

Increase the title length thresholds in the exact text matcher

Open lizgzil opened this issue 5 years ago • 2 comments

Text is found to match short reference titles in the main text of the policy document, and thus included as a match in the exact text match. Increase the title length thresholds in the exact text match in order to remove these false positives.

e.g. "attention deficit hyperactivity disorder" was found in the text of several documents and identified as a match to a paper with the same name.

lizgzil avatar Jul 18 '19 14:07 lizgzil

I am under the impression that 40 characters is a good limit based on Peter's analysis, by which I mean that after 40 there are few observed false positives, meaning we will get high precision but at the expense of false negatives in between 20 and 40 which means a lower recall. I am adding @aoifespenge so she is aware but at this point i think prioritising precision is a good thing as stakeholder will experience less false positives.

nsorros avatar Jul 19 '19 15:07 nsorros

This is related to, but not the same as, issue https://github.com/wellcometrust/reach/issues/449. Searching for titles in the main body of the text (exact matcher) and specifically in the parsed references will probably need slightly different title length thresholds since searching in the references section will already be narrowing down the results. Similarly the two types of search will have different accuracies.

lizgzil avatar Feb 26 '20 13:02 lizgzil