lucene-solr icon indicating copy to clipboard operation
lucene-solr copied to clipboard

LUCENE-8947: Skip field length accumulation when norms are disabled

Open dxl360 opened this issue 4 years ago • 8 comments

Description

Lucene accumulates the total length when indexing the field. But when we use custom term frequencies to hold arbitrary scoring signals, Lucene will run into integer overflow error during accumulation if the scoring signals and the number of tokens are too large. This PR aims to fix this issue https://issues.apache.org/jira/browse/LUCENE-8947

Solution

Skip the field length accumulation when norms is disabled.

Tests

The test tries to index a field with extremely large custom term frequency

  • Successfully index the field that omits norms
  • Expect to trigger the integer overflow error when indexing the same field with norms disabled

Checklist

Please review the following and check all that apply:

  • [x] I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • [x] I have created a Jira issue and added the issue ID to my pull request title.
  • [x] I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
  • [x] I have developed this patch against the master branch.
  • [x] I have run ./gradlew check.
  • [x] I have added tests for my changes.
  • [ ] I have added documentation for the Ref Guide (for Solr changes only).

dxl360 avatar Nov 13 '20 20:11 dxl360

I'm concerned about this change: other things will overflow if you have too many term frequencies in a field. Currently frequency is bounded by 2^32-1 within a doc, and you can only have 2^32-1 documents in the index, so stats like totalTermFreq and sumTotalTermFreq can't overflow. But with this change it would be easy to do this and break scoring, fail checkindex, etc.

rmuir avatar Nov 14 '20 14:11 rmuir

Original implementation accumulates int invertState.length (number of tokens) by term frequency and will overflow if the term frequency is too large. Can we increment length by 1 for each token when we use custom term frequencies to hold arbitrary scoring signals (norms is disabled)? In this way, the number of tokens is bounded by 2,147,483,647 and long totalTermFreq/sumTotalTermFreq won't overflow.

dxl360 avatar Nov 18 '20 06:11 dxl360

totalTermFreq/sumTotalTermFreq are about term frequencies, nothing to do with the norms... but this norms check is the only thing guarding against overflow. we can't just disable the check like this without causing additional problems.

rmuir avatar Nov 18 '20 15:11 rmuir

Had offline discussion with @mikemccand. Maybe we can change the type of invertState.length from int to long and keep the current check on field length/termFreq accumulation but safely cast the length back to int when calculating the norms. long totalTermFreq/sumTotalTermFreq is not expected to be broken by invertState.length.

dxl360 avatar Nov 23 '20 20:11 dxl360

I now understand @rmuir 's concern: because today we force sum of term freq within a single document to fit in int (during this invertState.length accumulation for norms), and because Lucene allows at most Integer.MAX_VALUE documents, we know that totalTermFreq (per term sum of term freq across all docs) cannot possibly overflow long.

Hmm, but I think sumTotalTermFreq, which is per field sum of all totalTermFreq across all terms in that field, could overflow long even today, in and adversarial case. And it would not be detected by Lucene...

I think it is weird to rely on norms (doc length) accumulation to prevent overflow of totalTermFreq and sumTotalTermFreq, especially if norms are in fact disabled for that field?

How about decoupling these two problems? First, let's fix the aggregation of totalTermFreq and sumTotalTermFreq to explicitly catch any overflow instead of just doing the dangerous += today: https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/codecs/PushPostingsWriterBase.java#L142 and https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/codecs/blocktree/BlockTreeTermsWriter.java#L915? I.e. switch these accumluations to Math.addExact. This will explicitly catch long overflow for either of these stats.

Second, let's not accumulate invertState.length if norms are disabled (@dxl360 's current PR).

I think this would solve both issues well -- prevent long overflow, and allow large custom term frequencies in one document when norms are disabled.

mikemccand avatar Dec 14 '20 17:12 mikemccand

Hmm, but I think sumTotalTermFreq, which is per field sum of all totalTermFreq across all terms in that field, could overflow long even today, in and adversarial case. And it would not be detected by Lucene...

I don't think so. I like to think of this as "number of tokens" in the corpus. Because each doc is limited to Integer.MAX_VALUE and there can only be Integer.MAX_VALUE docs, sumTotalTermFreq can't overflow. and totalTermFreq is <= sumTotalTermFreq (it would be equal, in a degraded case where all your documents only have a single word repeated many times).

rmuir avatar Dec 14 '20 18:12 rmuir

How about decoupling these two problems? First, let's fix the aggregation of totalTermFreq and sumTotalTermFreq to explicitly catch any overflow instead of just doing the dangerous += today: https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/codecs/PushPostingsWriterBase.java#L142 and https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/codecs/blocktree/BlockTreeTermsWriter.java#L915? I.e. switch these accumluations to Math.addExact. This will explicitly catch long overflow for either of these stats.

I don't think this is correct. You wouldn't trip this until after merge, far after you've already overflowed the values and caused broken search results (assuming you have more than one segment).

rmuir avatar Dec 14 '20 18:12 rmuir

Hmm, but I think sumTotalTermFreq, which is per field sum of all totalTermFreq across all terms in that field, could overflow long even today, in and adversarial case. And it would not be detected by Lucene...

I don't think so. I like to think of this as "number of tokens" in the corpus. Because each doc is limited to Integer.MAX_VALUE and there can only be Integer.MAX_VALUE docs, sumTotalTermFreq can't overflow. and totalTermFreq is <= sumTotalTermFreq (it would be equal, in a degraded case where all your documents only have a single word repeated many times).

Ahh you're right ... no more than Integer.MAX_VALUE tokens in one document, OK.

How about decoupling these two problems? First, let's fix the aggregation of totalTermFreq and sumTotalTermFreq to explicitly catch any overflow instead of just doing the dangerous += today: https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/codecs/PushPostingsWriterBase.java#L142 and https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/codecs/blocktree/BlockTreeTermsWriter.java#L915? I.e. switch these accumluations to Math.addExact. This will explicitly catch long overflow for either of these stats.

I don't think this is correct. You wouldn't trip this until after merge, far after you've already overflowed the values and caused broken search results (assuming you have more than one segment).

Hrmph, also correct, boo.

Alright I guess there is nothing we can fix here ... applications simply must not create > Integer.MAX_VALUE term frequencies in one doc/field.

mikemccand avatar Dec 15 '20 15:12 mikemccand