lucene Blended queries with boolean rewrite can result in inconsistent scores [LUCENE-9269]

If two blended queries are should clauses of a boolean query and are built so that

some of their terms are the same
their rewrite method is BlendedTermQuery.BOOLEAN_REWRITE

the docFreq for the overlapping terms used for scoring is picked as follow:

if the overlapping terms are not boosted, the df of the term in the first blended query is used
if any of the overlapping terms is boosted, the df is picked at (what looks like) random.

A few examples using a field with 2 terms: f:a (df: 2), and f:b (df: 3).

a)
Blended(f:a f:b) Blended (f:a)
        df: 3             df: 2
gets rewritten to:
(f:a)^2.0 (f:b)
df: 3      df:2

b)
Blended(f:a) Blended(f:a f:b)
        df: 2        df: 3
gets rewritten to:
(f:a)^2.0 (f:b)
 df: 2     df:2

c)
Blended(f:a f:b^0.66) Blended (f:a^0.75)
        df: 3                  df: 2
gets rewritten to:
(f:a)^1.75 (f:b)^0.66
 df:?       df:2

with ? either 2 or 3, depending on the run.

Migrated from LUCENE-9269 by Michele Palmia (@micpalmia), updated Apr 11 2022 Attachments: LUCENE-9269-test.patch

Mar 10 '20 12:03 asfimport

Michele Palmia (@micpalmia) (migrated from JIRA)

I added a very simple test (with my very limited Lucene testing skills) that emulates example c) above and checks for the score of the top document. As there is no "right" score, I just check for one of the two possible scores and have the test fail on the other.

I'm having a hard time wrapping my head around what the right behavior should be in this case (and thus coming up with a more sensible test and fix).

In case that's useful, I should probably add that the randomness in the scoring behavior is due to the HashMap underlying MultiSet: when should clauses are processed for deduplication, they're served in an arbitrary order (see [BooleanQuery.java:370|[https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/BooleanQuery.java#L370]])

Mar 10 '20 12:03 asfimport

Adrien Grand (@jpountz) (migrated from JIRA)

We should remove BlendedTermQuery eventually. It tries to solve cross-field search and synonym search at the same time, which introduces complications... Since you seem to be using it for the synonym case, you can look at SynonymQuery, which can deal with multiple synonyms that have different boosts today already. For cross-field search, we have a BM25FQuery, though I hope we'll find ways to make it easier to use in the future, e.g. by moving the scoring logic to Similarity.

Mar 10 '20 12:03 asfimport

Michele Palmia (@micpalmia) (migrated from JIRA)

I was actually just looking at a user report that came to lucene-dev and looked interesting. In their use case, they were using fuzzy queries, that in turn generate blended queries that are affected by this issue. Maybe users of BlendedQuery/FuzzyQuery should be able to find some form of warning in the docs (while #9884 is not fixed)?

Mar 10 '20 13:03 asfimport

Adrien Grand (@jpountz) (migrated from JIRA)

For this particular issue, I think that the right fix would be to fix TermQuery#equals and hashCode to take perReaderTermState into account. Queries shouldn't be considered equal if they might return different scores. I don't think that this would have bad side-effects as boolean rewrites are generally used for scoring queries, which are not cached (which is the other typicall call-site for Query#equals/hashCode).

Mar 10 '20 18:03 asfimport

Michele Palmia (@micpalmia) (migrated from JIRA)

I have a few questions, please feel free to let me know if they're too dumb:

while testing a solution for adding perReaderTermState to the current TermQuery#equals implementation, I found a test that I believe is not doing anything of what it was designed to do - essentially it was rewritten for an only tangentially related change, and it's been working as no-op since (test is TestMultiTermQueryRewrites#checkBoosts, problematic edit was this, missing essential initialSeekTerm). Should I fix it as part of my proposal for this or open a new issue?
What's your opinion on comparing two TermQueries only one of which has a perReaderTermState? I'd say the're different, but their Weights could ultimately end up using the exact same statistics.
Changing equals without changing toString mean errors like
```
expected:<foo:bar> but was:<foo:bar>
```
are possible. That seems to me less of an issue than adding df/ttf to the TermQuery representation. Is that so?

Mar 22 '20 14:03 asfimport

Michael Sokolov (@msokolov) (migrated from JIRA)

Does the checkBoosts test case you refer to fail when you attempt your change? If so, please address it. Otherwise, I think it should be fixed separately
It's OK for two different queries to behave the same, and I don't see how you can know that they will in this case, so they should compare different I think
again, toString() is not guaranteed to be different for different queries; I think it's OK

Apr 11 '22 13:04 asfimport

I recently encountered this issue when migrating from Solr 7.x to 9.x. For me it caused wrong scores for a query that consists of exact match with a boost ORed with a fuzzy match, e.g.: text:foobar^10 OR text:foobar~1 I think it should be quite a popular use-case, so I'm a little surprised this issue is not more popular. Debugging it I came to the same conclusion, that TermQuery should take into consideration its perReaderTermState field in equals and hashCode implementation. I have made a simple app that demonstrates the problem: bug-example.zip Please note that this problem may not always be visible and seem random because it depends on hash codes that depend on StringHelper.GOOD_FAST_HASH_SEED which is based on the current time. It is useful to set tests.seed system property to reproduce the problem consistently.

@micpalmia have you succeeded in implementing the fix? Is there any chance for a PR?

May 22 '23 10:05 rafalh

while testing a solution for adding perReaderTermState to the current TermQuery#equals implementation, I found a test that I believe is not doing anything of what it was designed to do - essentially it was rewritten for an only tangentially related change, and it's been working as no-op since (test is TestMultiTermQueryRewrites#checkBoosts, problematic edit was this, missing essential initialSeekTerm). Should I fix it as part of my proposal for this or open a new issue?

Yikes -- did you ever open a separate issue/PR for this missing initialSeekTerm?

Nov 02 '23 14:11 mikemccand

Same issue here with TermQuery + boost and FuzzyQuery. (text:foobar^10 OR text:foobar~1) The bug and PR seem to be stuck. Any chance we can revive this fix?

Nov 17 '25 13:11 BelovEvgeny