lucene-solr
                                
                                 lucene-solr copied to clipboard
                                
                                    lucene-solr copied to clipboard
                            
                            
                            
                        LUCENE-8947: Skip field length accumulation when norms are disabled
Description
Lucene accumulates the total length when indexing the field. But when we use custom term frequencies to hold arbitrary scoring signals, Lucene will run into integer overflow error during accumulation if the scoring signals and the number of tokens are too large. This PR aims to fix this issue https://issues.apache.org/jira/browse/LUCENE-8947
Solution
Skip the field length accumulation when norms is disabled.
Tests
The test tries to index a field with extremely large custom term frequency
- Successfully index the field that omits norms
- Expect to trigger the integer overflow error when indexing the same field with norms disabled
Checklist
Please review the following and check all that apply:
- [x] I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
- [x] I have created a Jira issue and added the issue ID to my pull request title.
- [x] I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
- [x] I have developed this patch against the masterbranch.
- [x] I have run ./gradlew check.
- [x] I have added tests for my changes.
- [ ] I have added documentation for the Ref Guide (for Solr changes only).
I'm concerned about this change: other things will overflow if you have too many term frequencies in a field. Currently frequency is bounded by 2^32-1 within a doc, and you can only have 2^32-1 documents in the index, so stats like totalTermFreq and sumTotalTermFreq can't overflow. But with this change it would be easy to do this and break scoring, fail checkindex, etc.
Original implementation accumulates int invertState.length (number of tokens) by term frequency and will overflow if the term frequency is too large. Can we increment length by 1 for each token when we use custom term frequencies to hold arbitrary scoring signals (norms is disabled)? In this way, the number of tokens is bounded by 2,147,483,647 and long totalTermFreq/sumTotalTermFreq won't overflow.
totalTermFreq/sumTotalTermFreq are about term frequencies, nothing to do with the norms... but this norms check is the only thing guarding against overflow. we can't just disable the check like this without causing additional problems.
Had offline discussion with @mikemccand. Maybe we can change the type of invertState.length from int to long and keep the current check on field length/termFreq accumulation but safely cast the length back to int when calculating the norms. long totalTermFreq/sumTotalTermFreq is not expected to be broken by invertState.length.
I now understand @rmuir 's concern: because today we force sum of term freq within a single document to fit in int (during this invertState.length accumulation for norms), and because Lucene allows at most Integer.MAX_VALUE documents, we know that totalTermFreq (per term sum of term freq across all docs) cannot possibly overflow long.
Hmm, but I think sumTotalTermFreq, which is per field sum of all totalTermFreq across all terms in that field, could overflow long even today, in and adversarial case.  And it would not be detected by Lucene...
I think it is weird to rely on norms (doc length) accumulation to prevent overflow of totalTermFreq and sumTotalTermFreq, especially if norms are in fact disabled for that field?
How about decoupling these two problems?  First, let's fix the aggregation of totalTermFreq and sumTotalTermFreq to explicitly catch any overflow instead of just doing the dangerous += today: https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/codecs/PushPostingsWriterBase.java#L142 and https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/codecs/blocktree/BlockTreeTermsWriter.java#L915?  I.e. switch these accumluations to Math.addExact.  This will explicitly catch long overflow for either of these stats.
Second, let's not accumulate invertState.length if norms are disabled (@dxl360 's current PR).
I think this would solve both issues well -- prevent long overflow, and allow large custom term frequencies in one document when norms are disabled.
Hmm, but I think sumTotalTermFreq, which is per field sum of all totalTermFreq across all terms in that field, could overflow long even today, in and adversarial case. And it would not be detected by Lucene...
I don't think so. I like to think of this as "number of tokens" in the corpus. Because each doc is limited to Integer.MAX_VALUE and there can only be Integer.MAX_VALUE docs, sumTotalTermFreq can't overflow. and totalTermFreq is <= sumTotalTermFreq (it would be equal, in a degraded case where all your documents only have a single word repeated many times).
How about decoupling these two problems? First, let's fix the aggregation of totalTermFreq and sumTotalTermFreq to explicitly catch any overflow instead of just doing the dangerous += today: https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/codecs/PushPostingsWriterBase.java#L142 and https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/codecs/blocktree/BlockTreeTermsWriter.java#L915? I.e. switch these accumluations to Math.addExact. This will explicitly catch long overflow for either of these stats.
I don't think this is correct. You wouldn't trip this until after merge, far after you've already overflowed the values and caused broken search results (assuming you have more than one segment).
Hmm, but I think sumTotalTermFreq, which is per field sum of all totalTermFreq across all terms in that field, could overflow long even today, in and adversarial case. And it would not be detected by Lucene...
I don't think so. I like to think of this as "number of tokens" in the corpus. Because each doc is limited to Integer.MAX_VALUE and there can only be Integer.MAX_VALUE docs, sumTotalTermFreq can't overflow. and totalTermFreq is <= sumTotalTermFreq (it would be equal, in a degraded case where all your documents only have a single word repeated many times).
Ahh you're right ... no more than Integer.MAX_VALUE tokens in one document, OK.
How about decoupling these two problems? First, let's fix the aggregation of totalTermFreq and sumTotalTermFreq to explicitly catch any overflow instead of just doing the dangerous += today: https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/codecs/PushPostingsWriterBase.java#L142 and https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/codecs/blocktree/BlockTreeTermsWriter.java#L915? I.e. switch these accumluations to Math.addExact. This will explicitly catch long overflow for either of these stats.
I don't think this is correct. You wouldn't trip this until after merge, far after you've already overflowed the values and caused broken search results (assuming you have more than one segment).
Hrmph, also correct, boo.
Alright I guess there is nothing we can fix here ... applications simply must not create > Integer.MAX_VALUE term frequencies in one doc/field.