opengrok
opengrok copied to clipboard
need a way how to limit the size of files processed by indexer (Bugzilla #19176)
status NEW severity enhancement in component indexer for --- Reported in version unspecified on platform ANY/Generic Assigned to: Trond Norbye
On 2012-02-15 13:52:01 +0000, Vladimir Kotal wrote:
Recent reindexing with 0.11 revealed that the indexer cannot cope with larger files and just blows up (JAVA_OPTS is default, set to 2 GB):
2012-02-15 14:30:53.572+0100 INFO t15 DefaultIndexChangedListener.fileAdd: Add: /foo.cpio (PlainAnalyzer) 2012-02-15 14:31:43.178+0100 SEVERE t15 IndexDatabase$1.run: Problem updating lucene index database: java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2882) at org.opensolaris.opengrok.analysis.plain.PlainAnalyzer.analyze(PlainAnalyzer.java:77) at org.opensolaris.opengrok.analysis.TextAnalyzer.analyze(TextAnalyzer.java:60) at org.opensolaris.opengrok.analysis.AnalyzerGuru.getDocument(AnalyzerGuru.java:262) at org.opensolaris.opengrok.index.IndexDatabase.addFile(IndexDatabase.java:584) at org.opensolaris.opengrok.index.IndexDatabase.indexDown(IndexDatabase.java:814) at org.opensolaris.opengrok.index.IndexDatabase.indexDown(IndexDatabase.java:787) at org.opensolaris.opengrok.index.IndexDatabase.indexDown(IndexDatabase.java:787) at org.opensolaris.opengrok.index.IndexDatabase.update(IndexDatabase.java:354) at org.opensolaris.opengrok.index.IndexDatabase$1.run(IndexDatabase.java:158) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662)
2012-02-15 14:31:43.194+0100 INFO t10 Indexer.sendToConfigHost: Send configuration to: localhost:2424 2012-02-15 14:31:44.488+0100 INFO t10 Indexer.sendToConfigHost: Configuration update routine done, check log output for errors.
$ du -sh /foo.cpio 311M /foo.cpio
There should be an option which would allow us to say that files larger than xy bytes should be ignored by the indexer (similar to the -i option for filenames).
On 2012-02-15 13:54:37 +0000, Vladimir Kotal wrote:
Maybe there should even be some sane default, like 100 MB.
On 2012-02-16 12:26:12 +0000, Knut Anders Hatlen wrote:
The analyzers don't really need to read the entire file into memory, they could also operate on streams. The reason why they do read the file into memory, I think, is to avoid reading every file twice (once to add it to the Lucene indexes, and once to build the xref). I'm not sure how important this optimization is (should run some experiments to see).
Even 100MB is not enough in some cases, e.g. 48 MB XHTML file can cause indexer to run out of heap (issue #907).
Thinking of this some more, maybe the limits should be smarter as some analyzers might be more susceptible to bigger files, i.e. allow limits based on file type (if possible).
What to do with files that were indexed and then grew above the threshold ? Or when the threshold (assuming it will be tunable) is lowered so that previously indexed files will no longer be eligible ? It seems to me that the correct solution would be to delete the information from the index.