lucene icon indicating copy to clipboard operation
lucene copied to clipboard

Fix ByteBlockPool integer overflow by implementing buffer limit detection

Open ashish159357 opened this issue 3 months ago • 9 comments

Problem

ByteBlockPool uses 32KB buffers with an integer offset tracker ( byteOffset). When more than 65,535 buffers are allocated, integer overflow occurs in the byteOffset calculation (byteOffset = bufferUpto * BYTE_BLOCK_SIZE), causing ArithmeticException during indexing of documents with large numbers of tokens.

Root Cause

  • Each buffer is 32KB (BYTE_BLOCK_SIZE = 32768)
  • Maximum safe buffer count: Integer.MAX_VALUE / BYTE_BLOCK_SIZE = 65535
  • When bufferUpto >= 65535, the multiplication overflows

Solution Implement proactive DWPT flushing when buffer count approaches the limit:

  1. Detection: Added isApproachingBufferLimit() method to detect when buffer count approaches the overflow threshold
  2. Propagation: Buffer limit status flows from ByteBlockPool → IndexingChain → DocumentsWriterPerThread → DocumentsWriterFlushControl
  3. Prevention: Force flush DWPT before overflow occurs, similar to existing RAM-based flushing.

Key Changes

  • Added buffer limit detection in ByteBlockPool
  • Integrated check into DocumentsWriterFlushControl.doAfterDocument()
  • Uses threshold of 65,000 to provide safety margin before actual limit of 65,535
  • Maintains existing performance characteristics while preventing crashes

ashish159357 avatar Oct 12 '25 15:10 ashish159357

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

github-actions[bot] avatar Oct 12 '25 15:10 github-actions[bot]

When more than 65,535 buffers are allocated, integer overflow occurs in the byteOffset calculation (byteOffset = bufferUpto * BYTE_BLOCK_SIZE), causing ArithmeticException during indexing of documents with large numbers of tokens.

But this is not supported: the limits on IndexWriter are 2GB

rmuir avatar Oct 12 '25 15:10 rmuir

maybe AI-generated? The bullet point formatting looks characteristic. Not that that is banned or anything, but it might need additional scrutiny

msokolov avatar Oct 12 '25 15:10 msokolov

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

github-actions[bot] avatar Oct 12 '25 15:10 github-actions[bot]

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

github-actions[bot] avatar Oct 13 '25 13:10 github-actions[bot]

Hi @rmuir @msokolov , I'm yet to review this PR. But I see your points as the hard limit check should be enough as it accounts for byteBlockPool as well.

For context , I originally created this issue https://github.com/apache/lucene/issues/15152 - where an opensearch user encountered the byteblockpool overflow during recovery.

 message [shard failure, reason [index id[3458764570588151359] origin[LOCAL_TRANSLOG_RECOVERY] seq#[53664468]]], failure [NotSerializableExceptionWrapper[arithmetic_exception: integer overflow]], markAsStale [true]]
NotSerializableExceptionWrapper[arithmetic_exception: integer overflow]
    at java.lang.Math.addExact(Math.java:883)
    at org.apache.lucene.util.ByteBlockPool.nextBuffer(ByteBlockPool.java:199)
    at org.apache.lucene.index.ByteSlicePool.allocKnownSizeSlice(ByteSlicePool.java:118)
    at org.apache.lucene.index.ByteSlicePool.allocSlice(ByteSlicePool.java:98)
    at org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:226)
    at org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:266)
    at org.apache.lucene.index.FreqProxTermsWriterPerField.writeProx(FreqProxTermsWriterPerField.java:86)
    at org.apache.lucene.index.FreqProxTermsWriterPerField.addTerm(FreqProxTermsWriterPerField.java:197)
    at org.apache.lucene.index.TermsHashPerField.positionStreamSlice(TermsHashPerField.java:214)
    at org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:202)
    at org.apache.lucene.index.IndexingChain$PerField.invertTokenStream(IndexingChain.java:1287)
    at org.apache.lucene.index.IndexingChain$PerField.invert(IndexingChain.java:1183)
    at org.apache.lucene.index.IndexingChain.processField(IndexingChain.java:731)
    at org.apache.lucene.index.IndexingChain.processDocument(IndexingChain.java:609)
    at org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:263)
    at org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:425)
    at org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1558)
    at org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1516)
    at org.opensearch.index.engine.InternalEngine.addStaleDocs(InternalEngine.java:1291)
    at org.opensearch.index.engine.InternalEngine.indexIntoLucene(InternalEngine.java:1210)
    at org.opensearch.index.engine.InternalEngine.index(InternalEngine.java:1011)
    at org.opensearch.index.shard.IndexShard.index(IndexShard.java:1226)

I think the check for IndexWriterHardLimit in FlushControl comes after we do DocumentsWriter.updateDocuments where adding many documents could potentially exceed the limit and hit this exception.

  1. Do we need a buffer for writer limits to account for next set of documents ?
  2. Do we need to limit the number of docs that can be passed to this method ?

bharath-techie avatar Oct 13 '25 16:10 bharath-techie

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

github-actions[bot] avatar Oct 14 '25 02:10 github-actions[bot]

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

github-actions[bot] avatar Oct 14 '25 02:10 github-actions[bot]

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

github-actions[bot] avatar Oct 29 '25 00:10 github-actions[bot]