lucene UnsupportedOperation when merging `Lucene90BlockTreeTermsWriter`

Description

Found this in the wild. I haven't been able to replicate :(

I don't even know what it means to hit this fst.outputs.merge branch and under what conditions is it valid/invalid. Any pointers here would be useful.

We ran into a strange postings merge error in production.

The FST compiler reaches the "merge" line when merging some segments:

https://github.com/apache/lucene/blob/4b94d97a26aabebfb301f1bb34af0b4cdc284e79/lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java#L933-L936

However, the "outputs" provided by Lucene90BlockTreeTermsWriter is ByteSequenceOutputs, which does not override merge, and thus throws an unsupported operation exception.

https://github.com/apache/lucene/blob/4b94d97a26aabebfb301f1bb34af0b4cdc284e79/lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/Lucene90BlockTreeTermsWriter.java#L534-L548

Given this, it seems like it should be "impossible" to reach the "Outputs.merge" path when merging with the Lucene90BlockTreeTermsWriter, but somehow it did.

Any ideas on where I should look?

at org.apache.lucene.util.fst.Outputs.merge(Outputs.java:95) ~[lucene-core-9.11.1.jar:?]
at org.apache.lucene.util.fst.FSTCompiler.add(FSTCompiler.java:936) ~[lucene-core-9.11.1.jar:?]
at org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsWriter$PendingBlock.append(Lucene90BlockTreeTermsWriter.java:593) ~[lucene-core-9.11.1.jar:?]
at org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsWriter$PendingBlock.compileIndex(Lucene90BlockTreeTermsWriter.java:562) ~[lucene-core-9.11.1.jar:?]
at org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsWriter$TermsWriter.writeBlocks(Lucene90BlockTreeTermsWriter.java:776) ~[lucene-core-9.11.1.jar:?]
at org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsWriter$TermsWriter.finish(Lucene90BlockTreeTermsWriter.java:1163) ~[lucene-core-9.11.1.jar:?]
at org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsWriter.write(Lucene90BlockTreeTermsWriter.java:402) ~[lucene-core-9.11.1.jar:?]
at org.apache.lucene.codecs.FieldsConsumer.merge(FieldsConsumer.java:95) ~[lucene-core-9.11.1.jar:?]
at org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsWriter.merge(PerFieldPostingsFormat.java:204) ~[lucene-core-9.11.1.jar:?]
at org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:211) ~[lucene-core-9.11.1.jar:?]
at org.apache.lucene.index.SegmentMerger.mergeWithLogging(SegmentMerger.java:300) ~[lucene-core-9.11.1.jar:?]
at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:139) ~[lucene-core-9.11.1.jar:?]
at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:5293) ~[lucene-core-9.11.1.jar:?]
at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4761) ~[lucene-core-9.11.1.jar:?]
at org.apache.lucene.index.IndexWriter$IndexWriterMergeSource.merge(IndexWriter.java:6582) ~[lucene-core-9.11.1.jar:?]
at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:660) ~[lucene-core-9.11.1.jar:?]
at org.elasticsearch.index.engine.ElasticsearchConcurrentMergeScheduler.doMerge(ElasticsearchConcurrentMergeScheduler.java:134) ~[elasticsearch-8.15.0.jar:?]
at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:721) ~[lucene-core-9.11.1.jar:?]```

### Version and environment details

Lucene 9.11.1

Apr 02 '25 20:04 benwtrent

Phew, this is a spooky exception!

I think it means that the same term was fed to the FST Builder twice in row. FST Builder in general can support this case, and it means that a single output can have multiple outputs, and the Outputs impl is supposed to be able to combine multiple outputs into a set (internally). But you're right: in this context (BlockTree) there should never be the same term added more than once, and each term has a single output, and the Outputs impl does not support it. It is indeed NOT supposed to happen!

BlockTree is confusing in how it builds up its blocks. It does it one sub-tree at a time, using intermediate FSTs to hold each sub-tree, and then regurgitating the terms from each subtree with FSTTermsEnum, adding them into a bigger FST Builder to combine multiple sub-trees into a single FST. It keeps doing this up and up the terms trie until it gets to empty string and then that FST is the terms index.

So .... somehow this regurgitation process added the same term twice in a row. This means either a given FSTTermsEnum returned the same term twice in a row, or, somehow a term was duplicated at the boundary (where one FSTTermsEnum ended from a sub-block, and next FSTTermsEnum began).

Do we know any fun details about the use case? Maybe an exotic/old JVM? Massive numbers of terms...? Or the terms are some crazy binary gene sequences or something?

Apr 02 '25 21:04 mikemccand

Thank you @mikemccand for some details!

Do we know any fun details about the use case? Maybe an exotic/old JVM? Massive numbers of terms...? Or the terms are some crazy binary gene sequences or something?

I will see what I can find.

Apr 03 '25 11:04 benwtrent

@mikemccand OK, I gathered more info:

Modern OpenJDK (22.0.1)
Modern Linux

So other system stuff doesn't seem very exotic.

However, the data being ingested might have various pieces of turkish unicode. Digging around the analyzers, I didn't find any special handling, so its all using the StandardAnalyzer with no additional normalization.

I wonder if we are just hitting the dreaded turkish "i" unicode issue

Apr 08 '25 15:04 benwtrent

Working more on this, we have ran multiple diagnostics on the machines, no hardware issues seem to arise.

This issue arises not only on merge, but I have seen it on flush.

Caused by: org.elasticsearch.common.io.stream.NotSerializableExceptionWrapper: unsupported_operation_exception: null
	at org.apache.lucene.util.fst.Outputs.merge(Outputs.java:95) ~[lucene-core-9.11.1.jar:?]
	at org.apache.lucene.util.fst.FSTCompiler.add(FSTCompiler.java:936) ~[lucene-core-9.11.1.jar:?]
	at org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsWriter$PendingBlock.append(Lucene90BlockTreeTermsWriter.java:593) ~[lucene-core-9.11.1.jar:?]
	at org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsWriter$PendingBlock.compileIndex(Lucene90BlockTreeTermsWriter.java:562) ~[lucene-core-9.11.1.jar:?]
	at org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsWriter$TermsWriter.writeBlocks(Lucene90BlockTreeTermsWriter.java:776) ~[lucene-core-9.11.1.jar:?]
	at org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsWriter$TermsWriter.finish(Lucene90BlockTreeTermsWriter.java:1163) ~[lucene-core-9.11.1.jar:?]
	at org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsWriter.write(Lucene90BlockTreeTermsWriter.java:402) ~[lucene-core-9.11.1.jar:?]
	at org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsWriter.write(PerFieldPostingsFormat.java:172) ~[lucene-core-9.11.1.jar:?]
	at org.apache.lucene.index.FreqProxTermsWriter.flush(FreqProxTermsWriter.java:134) ~[lucene-core-9.11.1.jar:?]
	at org.apache.lucene.index.IndexingChain.flush(IndexingChain.java:333) ~[lucene-core-9.11.1.jar:?]
	at org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:445) ~[lucene-core-9.11.1.jar:?]
	at org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:496) ~[lucene-core-9.11.1.jar:?]
	at org.apache.lucene.index.DocumentsWriter.maybeFlush(DocumentsWriter.java:450) ~[lucene-core-9.11.1.jar:?]
	at org.apache.lucene.index.DocumentsWriter.preUpdate(DocumentsWriter.java:391) ~[lucene-core-9.11.1.jar:?]
	at org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:413) ~[lucene-core-9.11.1.jar:?]
	at org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1561) ~[lucene-core-9.11.1.jar:?]
	at org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1519) ~[lucene-core-9.11.1.jar:?]

What's even weirder, I have seen it happen during document replication, meaning the primary index seems to have accepted the doc without issue :( and it only failed on replica.

I am still trying to get information about the field contents, but this is proving difficult.

Jun 17 '25 18:06 benwtrent

lucene lucene copied to clipboard

UnsupportedOperation when merging `Lucene90BlockTreeTermsWriter`

Description

lucene
lucene copied to clipboard