dkpro-core icon indicating copy to clipboard operation
dkpro-core copied to clipboard

Better phrase matching logic in DictionaryAnnotator

Open reckart opened this issue 10 years ago • 0 comments

The current DictionaryAnnotator implementation checks for overlapping phrases to be
annotated, and disallows overlapping annotations produced if and only if the overlapping
phrases start at the same token. I.e., for tokens sequence 
w1 w2 w3 w4 w5

and phrase tree containing
w2 w3 w4 + w2 w3
resulting markup is
"w2 w3 w4" ("w2 w3" is dropped).

However, the phrase tree containing
w2 w3 w4 + w3 w4
resulting markup is
"w2 w3 w4" + "w3 w4".

This is in my opinion, counter intuitive. The annotator would be better off allowing
any kind of overlap, or disallowing any kind of overlap. The first option resembles
more the intuitive phrase lookup usecase, while the latter is compatible with the "chunking
style" usecase, such as a dictionary based Named Entity Annotator. For the no-overlap
case, a longest match preferred, left-first in case of ties is a straightforward matching
strategy.

What steps will reproduce the problem?
1. Sample text above, w1 ... w5
2. Phrase as above
3. Check markups produced

What is the expected output? What do you see instead?

I think the expected output should be either all overlaps marked up or none, controlled
by a configuration parameter.

What version of the product are you using? On what operating system?

Current, non-OS-specific.

Please provide any additional information below.

This can be easily implemented by creating a small class for matched phrases, implementing
Comparable (compares by token length and then begin offset -- a.k.a. longest prefered,
left first), and storing all matches within the sentence in a TreeSet. Than while traversing
the set, i) all matches can be added, or ii) the current match can be quickly checked
against previous items in the TreeSet depending on the annotator's configuration parameter:

i:        final TreeSet<Match> allMatches = getMatches(...);
//NOTE, THIS IS VERY SIMILAR TO THE CURRENT ROUTINE, JUST ALSO ALLOWING MULTIPLE MATCHES
WITH THE SAME START TOKEN

//AND THEN IN 6 LINES, HOW TO SELECT NON-OVERLAPPING MATCHES (FROM THE SORTED SET AS
DESCRIBED ABOVE)
ii:        
for (final Match match : allMatches) {
    boolean overlaps = false;
    for (final Match matched : alreadySelected)
        if (!(matched.beginIndex > match.endIndex || matched.endIndex < match.beginIndex))
            overlaps = true;
}
if (!overlaps) alreadySelected.add(match);

Original issue reported on code.google.com by [email protected] on 2014-12-18 15:51:48

reckart avatar May 12 '15 22:05 reckart