lucene icon indicating copy to clipboard operation
lucene copied to clipboard

FastVectorHighlighter loses boost values in phrase expansion and nested queries

Open sunny6142 opened this issue 1 month ago • 4 comments

Description

FastVectorHighlighter incorrectly handles boost values in two scenarios:

  1. Phrase Expansion: When overlapping phrases with different boosts expand, the expanded phrase loses original boost values and defaults to 1.0
  2. Nested Boosts: Nested BoostQuery objects only preserve the outermost boost value, shouldn't we preserve max of them

Impact

  • Critical content gets weak highlighting despite high boost values
  • Search result emphasis doesn't reflect query importance
  • Breaks boost value semantics in search applications

Example

Scenario: Legal document search where "contract termination" (boost: 10.0) and "termination clause" (boost: 5.0) should highlight "contract termination clause" prominently.

Current Result: "contract termination clause" gets boost 1.0 → weak highlighting Expected Result: "contract termination clause" should get meaningful boost value → appropriate highlighting

Version and environment details

  • Lucene main branch
  • Component: lucene/highlighter (FastVectorHighlighter)
  • Likely affected code: FieldQuery.java phrase expansion logic

sunny6142 avatar Nov 20 '25 12:11 sunny6142

Scenario 1: Phrase Expansion Loses Boost

// Create overlapping boosted phrases
Query q1 = new BoostQuery(new PhraseQuery("contract", "termination"), 10.0f);
Query q2 = new BoostQuery(new PhraseQuery("termination", "clause"), 5.0f);
BooleanQuery query = new BooleanQuery.Builder()
    .add(q1, SHOULD).add(q2, SHOULD).build();

// FastVectorHighlighter processes this
FieldQuery fieldQuery = new FieldQuery(query, reader, true, true);
// Result: Expanded "contract termination clause" gets boost 1.0 instead of meaningful value

Scenario 2: Nested Boost

// Create nested boost query
PhraseQuery baseQuery = new PhraseQuery("important", "document");
Query innerBoost = new BoostQuery(baseQuery, 3.0f);
Query outerBoost = new BoostQuery(innerBoost, 2.0f);

// FastVectorHighlighter processes this
FieldQuery fieldQuery = new FieldQuery(outerBoost, reader, true, true);
// Result: Gets boost 2.0, should it get better boost

sunny6142 avatar Nov 20 '25 12:11 sunny6142

In Lucene, we're not always good at steering users to preferred functionality, away from old/legacy functionality, nor are we good about removing such old/legacy functionality. IMO this is true of the original Highlighter (which should probably be named that), and definitely the FastVectorHighlighter. Why are you using the FVH instead of the UnifiedHighlighter?

dsmiley avatar Nov 20 '25 14:11 dsmiley

@dsmiley Thanks for the guidance! I discovered this issue while working on a legacy code that uses FastVectorHighlighter, but I wasn't aware it's considered legacy.

After reviewing UnifiedHighlighter's code, it clearly handles these boost scenarios much better.

I’ll close this issue since fixing the legacy FVH isn’t worthwhile. Instead, I’d like to create a PR to deprecate FVH. Do you think this is something we should do?

/**
 * Another highlighter implementation.
 * 
 * @deprecated Use {@link org.apache.lucene.search.uhighlight.UnifiedHighlighter} instead.
 *             FastVectorHighlighter is legacy and has known issues with boost value handling
 *             and complex query support. UnifiedHighlighter provides better accuracy,
 *             performance, and is actively maintained.
 */
@Deprecated
public class FastVectorHighlighter {

sunny6142 avatar Nov 21 '25 10:11 sunny6142

Sounds good to me. I'd like to raise this (and the original Highlighter) in the dev list first to get broad input before making a big decision of this nature.

dsmiley avatar Nov 21 '25 13:11 dsmiley

Hi @dsmiley ! Thanks for the insights. I am one of the contributors to Yelp/nrtsearch. And we are using FVH in production to highlight a list of phrases with different boost values. Currently, it serves the purpose as we don't actually highlight on multi-terms nor span queries.

Uhighlighter currently doesn't seem to be a good fit for migration because:

  1. query boost value is unsupported yet
  2. to highlight a phrase as a whole unit, weightMatch must be used. But we are afraid that there would be some highlight speed penalties because of it, as we are calling highlight() per document per field.

I am happy to make contributions back to uhighlighter. But in regards to priority, we can parallelize the work and make bug fixes in FVH at the same time: PR

waziqi89 avatar Dec 18 '25 18:12 waziqi89

Makes sense. Thanks for being willing to consider improving the UnifiedHighlighter for considering boosts. RE your point #2 -- until you benchmark -- hard to say. I'm skeptical the FVH could ever beat the UH simply because the UH doesn't need to load term vectors except for when wildcard (multi-term) queries are used.

I'll review this and the other linked PR in a bit... I'm telling myself tonight but we'll see :-)

dsmiley avatar Dec 22 '25 14:12 dsmiley