opengrok icon indicating copy to clipboard operation
opengrok copied to clipboard

Lucene exception while adding file: Document contains at least one immense term in field="full"

Open wizwin opened this issue 6 years ago • 15 comments

May 29, 2018 10:02:41 AM org.opensolaris.opengrok.index.IndexDatabase lambda$null$1
WARNING: ERROR addFile(): /external/icu/icu4c/source/data/coll/zh.txt
**java.lang.IllegalArgumentException: Document contains at least one immense term in field="full" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped.  Please correct the analyzer to not produce such terms.**  The prefix of the first immense term is: '[-27, -123, -103, -27, -123, -101, -27, -123, -98, -27, -123, -99, -27, -123, -95, -27, -123, -93, -27, -105, -89, -25, -109, -87, -25, -77, -114, -28, -72, -128]...', original message: bytes can be at most 32766 in length; got 39180
	at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:796)
	at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:430)
	at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:392)
	at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:240)
	at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:496)
	at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1729)
	at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1464)
	at org.opensolaris.opengrok.index.IndexDatabase.addFile(IndexDatabase.java:732)
	at org.opensolaris.opengrok.index.IndexDatabase.lambda$null$1(IndexDatabase.java:1049)
	at java.util.stream.Collectors.lambda$groupingByConcurrent$51(Collectors.java:1070)
	at java.util.stream.ReferencePipeline.lambda$collect$1(ReferencePipeline.java:496)
	at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184)
	at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
	at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
	at java.util.stream.ForEachOps$ForEachTask.compute(ForEachOps.java:291)
	at java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:731)
	at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
	at java.util.concurrent.ForkJoinPool$WorkQueue.pollAndExecCC(ForkJoinPool.java:1190)
	at java.util.concurrent.ForkJoinPool.helpComplete(ForkJoinPool.java:1879)
	at java.util.concurrent.ForkJoinPool.awaitJoin(ForkJoinPool.java:2045)
	at java.util.concurrent.ForkJoinTask.doInvoke(ForkJoinTask.java:404)
	at java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:734)
	at java.util.stream.ForEachOps$ForEachOp.evaluateParallel(ForEachOps.java:160)
	at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateParallel(ForEachOps.java:174)
	at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233)
	at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418)
	at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:583)
	at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:496)
	at org.opensolaris.opengrok.index.IndexDatabase.lambda$indexParallel$2(IndexDatabase.java:1038)
	at java.util.concurrent.ForkJoinTask$AdaptedCallable.exec(ForkJoinTask.java:1424)
	at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
	at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
	at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
	at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)
Caused by: org.apache.lucene.util.BytesRefHash$MaxBytesLengthExceededException: bytes can be at most 32766 in length; got 39180
	at org.apache.lucene.util.BytesRefHash.add(BytesRefHash.java:263)
	at org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:151)
	at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:786)
	... 33 more

wizwin avatar May 29 '18 11:05 wizwin

Can you post the contents of the file somewhere ?

Dne út 29. 5. 2018 13:25 uživatel WiZarD [email protected] napsal:

May 29, 2018 10:02:41 AM org.opensolaris.opengrok.index.IndexDatabase lambda$null$1 WARNING: ERROR addFile(): /external/icu/icu4c/source/data/coll/zh.txt java.lang.IllegalArgumentException: Document contains at least one immense term in field="full" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: '[-27, -123, -103, -27, -123, -101, -27, -123, -98, -27, -123, -99, -27, -123, -95, -27, -123, -93, -27, -105, -89, -25, -109, -87, -25, -77, -114, -28, -72, -128]...', original message: bytes can be at most 32766 in length; got 39180 at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:796) at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:430) at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:392) at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:240) at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:496) at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1729) at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1464) at org.opensolaris.opengrok.index.IndexDatabase.addFile(IndexDatabase.java:732) at org.opensolaris.opengrok.index.IndexDatabase.lambda$null$1(IndexDatabase.java:1049) at java.util.stream.Collectors.lambda$groupingByConcurrent$51(Collectors.java:1070) at java.util.stream.ReferencePipeline.lambda$collect$1(ReferencePipeline.java:496) at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184) at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382) at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) at java.util.stream.ForEachOps$ForEachTask.compute(ForEachOps.java:291) at java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:731) at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) at java.util.concurrent.ForkJoinPool$WorkQueue.pollAndExecCC(ForkJoinPool.java:1190) at java.util.concurrent.ForkJoinPool.helpComplete(ForkJoinPool.java:1879) at java.util.concurrent.ForkJoinPool.awaitJoin(ForkJoinPool.java:2045) at java.util.concurrent.ForkJoinTask.doInvoke(ForkJoinTask.java:404) at java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:734) at java.util.stream.ForEachOps$ForEachOp.evaluateParallel(ForEachOps.java:160) at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateParallel(ForEachOps.java:174) at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233) at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418) at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:583) at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:496) at org.opensolaris.opengrok.index.IndexDatabase.lambda$indexParallel$2(IndexDatabase.java:1038) at java.util.concurrent.ForkJoinTask$AdaptedCallable.exec(ForkJoinTask.java:1424) at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056) at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692) at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157) Caused by: org.apache.lucene.util.BytesRefHash$MaxBytesLengthExceededException: bytes can be at most 32766 in length; got 39180 at org.apache.lucene.util.BytesRefHash.add(BytesRefHash.java:263) at org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:151) at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:786) ... 33 more

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/oracle/opengrok/issues/2130, or mute the thread https://github.com/notifications/unsubscribe-auth/ACzGDBgDLijYoAA9XqjFII0tDZJUWpmPks5t3TAAgaJpZM4URTml .

vladak avatar May 29 '18 12:05 vladak

http://androidxref.com/6.0.1_r10/xref/external/icu/icu4c/source/data/coll/zh.txt

wizwin avatar May 30 '18 16:05 wizwin

This is fixed by PR #2104, which caps the maximum length of an indexed token (or else skips it entirely) while allowing other (eligible) tokens in a file to be handled.

idodeclare avatar May 31 '18 00:05 idodeclare

there is also another fix for this, but only enabled on jflex layer for few analysers, I guess we should enable it for plain analyzer , too

tarzanek avatar Jun 01 '18 08:06 tarzanek

is this issue fixed? i try with the 1.5.12 version ,still got this issue

xiaopao2014 avatar Mar 04 '21 02:03 xiaopao2014

command: opengrok-indexer -J=-Djava.util.logging.config.file=/home/llbeing/opengrok/etc/logging.properties -J=-Xmx8g -a /home/llbeing/opengrok/dist/lib/opengrok.jar -- -c /usr/local/bin/ctags -s /home/llbeing/opengrok_source -d /home/llbeing/opengrok/data -H -P -S -G -W /home/llbeing/opengrok/etc/configuration.xml -U http://localhost:8080/source > ./logout.log

logFile: https://drive.google.com/file/d/171_XDJg0etm7eRDVnF2PBEzAw0EcY4Aw/view?usp=sharing

problem file:https://drive.google.com/file/d/1FlJocecYxNBmMoXF-v9T7oQgMZ83Tzx4/view?usp=sharing

xiaopao2014 avatar Mar 04 '21 02:03 xiaopao2014

Attaching the files here. valid_utf16.txt opengrok_index_fail_log.log

vladak avatar Mar 04 '21 08:03 vladak

If the file really contains UTF-16 I wonder if this conflicts with UTF-8 being used internally in the indexer.

vladak avatar Mar 04 '21 08:03 vladak

but see from log that it's something with the file length issue. It‘s acceptable for me that if opengrok-index get success without this files

Caused by: org.apache.lucene.util.BytesRefHash$MaxBytesLengthExceededException: bytes can be at most 32766 in length; got 39180

xiaopao2014 avatar Mar 04 '21 10:03 xiaopao2014

Exactly the same issue when creating index on Android-11.0.0_r8.

Source file: external/icu/icu4c/source/data/coll/zh.txt Opengrok Rel: 1.5.11 OS: Ubuntu 16.04.7 LTS

BTW, issue #2211 and #2826 are also observed in the log.

GeoffreyLu avatar Apr 03 '21 01:04 GeoffreyLu

how fix it

hhhaiai avatar Jun 06 '22 10:06 hhhaiai

how fix it

Someone needs to come and resurrect the PR mentioned in https://github.com/oracle/opengrok/issues/2130#issuecomment-393358506 so that it is agreed upon.

vladak avatar Jun 06 '22 12:06 vladak

oho~~~

hhhaiai avatar Jun 08 '22 03:06 hhhaiai

Hi, I have the exact same issue trying to reindex the same Android version, running Opengrok 1.7.2.

Has someone find out the solution?

oliver-ap avatar Jul 18 '22 20:07 oliver-ap

The solution is to settle on agreeable fix in OpenGrok and implement it - see https://github.com/oracle/opengrok/issues/2130#issuecomment-1147370787

vladak avatar Jul 19 '22 07:07 vladak