lucene icon indicating copy to clipboard operation
lucene copied to clipboard

[LUCENE-2587] Highlighter fragment bug

Open elliotzlin opened this issue 2 years ago • 1 comments

Description (or a Jira issue link if you have one)

#3661 LUCENE-2587

The issue has a good write up of the bug.

To summarize, we start new fragments at the end offset of the previous fragment instead of the start offset of the first token of the fragment, which potentially introduces spurious un-analyzed chars in the fragment. To take the test case as an example, we analyze out punctuation when tokenizing the string. However when highlighting the fragment containing the hit we get a fragment that starts with a period ..

The fix here starts new fragments at the start offset of the token that leads the new fragment. We also store the end offset of the antecedent fragment so we can use that to determine whether we can merge contiguous fragments.

elliotzlin avatar Aug 15 '22 23:08 elliotzlin

If only we renamed "Highlighter" to "OriginalHighlighter", maybe folks wouldn't continue to using this thing. Is the UnifiedHighlighter not satisfying you, and if so, why not?

dsmiley avatar Oct 13 '22 22:10 dsmiley

@dsmiley apologies for my delay in getting back to your comment! I don't have any qualms about refactoring to deter people from using this. I took up this ticket more so to get involved with contributing to the Lucene project and found this in the backlog, and less so because I was using the Highlighter in a project.

elliotzlin avatar Sep 08 '23 06:09 elliotzlin

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

github-actions[bot] avatar Jan 08 '24 12:01 github-actions[bot]