lucene
lucene copied to clipboard
[LUCENE-2587] Highlighter fragment bug
Description (or a Jira issue link if you have one)
#3661 LUCENE-2587
The issue has a good write up of the bug.
To summarize, we start new fragments at the end offset of the previous fragment instead of the start offset of the first token of the fragment, which potentially introduces spurious un-analyzed chars in the fragment. To take the test case as an example, we analyze out punctuation when tokenizing the string. However when highlighting the fragment containing the hit we get a fragment that starts with a period .
.
The fix here starts new fragments at the start offset of the token that leads the new fragment. We also store the end offset of the antecedent fragment so we can use that to determine whether we can merge contiguous fragments.
If only we renamed "Highlighter" to "OriginalHighlighter", maybe folks wouldn't continue to using this thing. Is the UnifiedHighlighter not satisfying you, and if so, why not?
@dsmiley apologies for my delay in getting back to your comment! I don't have any qualms about refactoring to deter people from using this. I took up this ticket more so to get involved with contributing to the Lucene project and found this in the backlog, and less so because I was using the Highlighter in a project.
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!