Fix ByteBlockPool integer overflow by implementing buffer limit detection
Problem
ByteBlockPool uses 32KB buffers with an integer offset tracker ( byteOffset). When more than 65,535 buffers are allocated, integer overflow occurs in the byteOffset calculation (byteOffset = bufferUpto * BYTE_BLOCK_SIZE), causing ArithmeticException during indexing of documents with large numbers of tokens.
Root Cause
- Each buffer is 32KB (BYTE_BLOCK_SIZE = 32768)
- Maximum safe buffer count: Integer.MAX_VALUE / BYTE_BLOCK_SIZE = 65535
- When bufferUpto >= 65535, the multiplication overflows
Solution Implement proactive DWPT flushing when buffer count approaches the limit:
- Detection: Added isApproachingBufferLimit() method to detect when buffer count approaches the overflow threshold
- Propagation: Buffer limit status flows from ByteBlockPool → IndexingChain → DocumentsWriterPerThread → DocumentsWriterFlushControl
- Prevention: Force flush DWPT before overflow occurs, similar to existing RAM-based flushing.
Key Changes
- Added buffer limit detection in ByteBlockPool
- Integrated check into DocumentsWriterFlushControl.doAfterDocument()
- Uses threshold of 65,000 to provide safety margin before actual limit of 65,535
- Maintains existing performance characteristics while preventing crashes
This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.
When more than 65,535 buffers are allocated, integer overflow occurs in the byteOffset calculation (byteOffset = bufferUpto * BYTE_BLOCK_SIZE), causing ArithmeticException during indexing of documents with large numbers of tokens.
But this is not supported: the limits on IndexWriter are 2GB
maybe AI-generated? The bullet point formatting looks characteristic. Not that that is banned or anything, but it might need additional scrutiny
This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.
This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.
Hi @rmuir @msokolov , I'm yet to review this PR. But I see your points as the hard limit check should be enough as it accounts for byteBlockPool as well.
For context , I originally created this issue https://github.com/apache/lucene/issues/15152 - where an opensearch user encountered the byteblockpool overflow during recovery.
message [shard failure, reason [index id[3458764570588151359] origin[LOCAL_TRANSLOG_RECOVERY] seq#[53664468]]], failure [NotSerializableExceptionWrapper[arithmetic_exception: integer overflow]], markAsStale [true]]
NotSerializableExceptionWrapper[arithmetic_exception: integer overflow]
at java.lang.Math.addExact(Math.java:883)
at org.apache.lucene.util.ByteBlockPool.nextBuffer(ByteBlockPool.java:199)
at org.apache.lucene.index.ByteSlicePool.allocKnownSizeSlice(ByteSlicePool.java:118)
at org.apache.lucene.index.ByteSlicePool.allocSlice(ByteSlicePool.java:98)
at org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:226)
at org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:266)
at org.apache.lucene.index.FreqProxTermsWriterPerField.writeProx(FreqProxTermsWriterPerField.java:86)
at org.apache.lucene.index.FreqProxTermsWriterPerField.addTerm(FreqProxTermsWriterPerField.java:197)
at org.apache.lucene.index.TermsHashPerField.positionStreamSlice(TermsHashPerField.java:214)
at org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:202)
at org.apache.lucene.index.IndexingChain$PerField.invertTokenStream(IndexingChain.java:1287)
at org.apache.lucene.index.IndexingChain$PerField.invert(IndexingChain.java:1183)
at org.apache.lucene.index.IndexingChain.processField(IndexingChain.java:731)
at org.apache.lucene.index.IndexingChain.processDocument(IndexingChain.java:609)
at org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:263)
at org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:425)
at org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1558)
at org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1516)
at org.opensearch.index.engine.InternalEngine.addStaleDocs(InternalEngine.java:1291)
at org.opensearch.index.engine.InternalEngine.indexIntoLucene(InternalEngine.java:1210)
at org.opensearch.index.engine.InternalEngine.index(InternalEngine.java:1011)
at org.opensearch.index.shard.IndexShard.index(IndexShard.java:1226)
I think the check for IndexWriterHardLimit in FlushControl comes after we do DocumentsWriter.updateDocuments where adding many documents could potentially exceed the limit and hit this exception.
- Do we need a buffer for writer limits to account for next set of documents ?
- Do we need to limit the number of docs that can be passed to this method ?
This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.
This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!