opengrok Huge text handling

Hello,

Please consider for integration this patch to add Huge Text file handling.

Indexer and Configuration get two new settings, hugeTextThresholdBytes (default 1_000_000) and hugeTextLimitCharacters (default 5_000_000). The threshold determines when OpenGrok will override a PLAIN genre file as a hugetext DATA file instead. The character limit determines how much to read and index for hugetext (with contextless truncation); the limit may be zero.

hugeTextThresholdBytes is checked for applicable files with each run, while no state for hugeTextLimitCharacters is stored. Changing hugeTextLimitCharacters after indexing would require touching affected source code files to revise the index.

For affected gzip and bzip2 files, changes to either hugeTextThresholdBytes or hugeTextLimitCharacters would require touching affected compressed files to revise the index.

Thank you.

Apr 18 '20 04:04 idodeclare

Pull Request Test Coverage Report for Build 5464

122 of 219 (55.71%) changed or added relevant lines in 17 files are covered.
7 unchanged lines in 5 files lost coverage.
Overall coverage decreased (-0.07%) to 75.639%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
opengrok-indexer/src/main/java/org/opengrok/indexer/util/LimitedReader.java	18	19	94.74%
opengrok-indexer/src/main/java/org/opengrok/indexer/analysis/StreamSource.java	0	2	0.0%
opengrok-indexer/src/main/java/org/opengrok/indexer/analysis/archive/GZIPAnalyzer.java	2	4	50.0%
opengrok-indexer/src/main/java/org/opengrok/indexer/index/IndexDatabase.java	17	19	89.47%
opengrok-indexer/src/main/java/org/opengrok/indexer/analysis/archive/BZip2Analyzer.java	5	8	62.5%
opengrok-indexer/src/main/java/org/opengrok/indexer/index/Indexer.java	7	10	70.0%
opengrok-indexer/src/main/java/org/opengrok/indexer/web/SingleResult.java	0	6	0.0%
opengrok-web/src/main/java/org/opengrok/web/PageConfig.java	3	14	21.43%
opengrok-indexer/src/main/java/org/opengrok/indexer/web/SearchHelper.java	0	12	0.0%
opengrok-indexer/src/main/java/org/opengrok/indexer/analysis/AnalyzerGuru.java	25	43	58.14%
<!--	Total:	122	219

Files with Coverage Reduction	New Missed Lines	%
opengrok-indexer/src/main/java/org/opengrok/indexer/analysis/AnalyzerGuru.java	1	84.21%
opengrok-indexer/src/main/java/org/opengrok/indexer/index/IndexDatabase.java	1	60.19%
opengrok-indexer/src/main/java/org/opengrok/indexer/index/Indexer.java	1	51.3%
opengrok-indexer/src/main/java/org/opengrok/indexer/configuration/IndexTimestamp.java	2	42.86%
opengrok-web/src/main/java/org/opengrok/web/PageConfig.java	2	39.1%
<!--	Total:	7

Totals
Change from base Build 5462:	-0.07%
Covered Lines:	41408
Relevant Lines:	54744

💛 - Coveralls

Apr 18 '20 05:04 coveralls

this actually looks good, I just quickly skimmed through

maybe only one concern, I think we should WARN instead of FINE print if a file is skipped because of limits ... (unless FINE is printed by default to log file ... but then I'd like to see those as WARN on console too ... )

Apr 20 '20 18:04 tarzanek

What happens if index is created with particular limits and then the limits are changed ?

Apr 20 '20 19:04 vladak

maybe only one concern, I think we should WARN instead of FINE print if a file is skipped because of limits ... (unless FINE is printed by default to log file ... but then I'd like to see those as WARN on console too ... )

No file is skipped. It is still included but under the HugeTextAnalyzer where the user chooses how much to analyze (as low as 0). It also gets no xref.

Apr 20 '20 21:04 idodeclare

What happens if index is created with particular limits and then the limits are changed ?

I tried to describe that above, but to clarify:

You can change hugeTextThresholdBytes and then re-index, and (as applicable) PLAIN files will be reclassified as hugetext DATA or hugetext DATA will be re-classified to PLAIN.

If you change hugeTextLimitCharacters and re-index, then nothing will happen because no state is stored related to that limit.

PLAIN files inside gzip or bzip2 are handled by HugeTextAnalyzer, but changing hugeTextThresholdBytes and re-indexing does nothing because the uncompressed size is not known when checkSettings would need to decide. (Changing hugeTextLimitCharacters and reindexing also does nothing for gzip or bzip2, for the same reason above.)

Apr 20 '20 21:04 idodeclare

sorry, I meant "trimmed" down, not skipped anyways, logs should say it loud that file x and y are now not considered fully I don't expect this output will flood the screen, but it should be a warning to get printed on console

Apr 21 '20 06:04 tarzanek

@tarzanek , that's done.

Apr 21 '20 22:04 idodeclare

PLAIN files inside gzip or bzip2 are handled by HugeTextAnalyzer, but changing hugeTextThresholdBytes and re-indexing does nothing because the uncompressed size is not known when checkSettings would need to decide.

I suppose it would be straight-forward to store a value for uncompressed size in the Document so that we could compare against hugeTextThresholdBytes in checkSettings for gzip/bzip2. Worth it?

Apr 21 '20 22:04 idodeclare

PLAIN files inside gzip or bzip2 are handled by HugeTextAnalyzer, but changing hugeTextThresholdBytes and re-indexing does nothing because the uncompressed size is not known when checkSettings would need to decide.

I suppose it would be straight-forward to store a value for uncompressed size in the Document so that we could compare against hugeTextThresholdBytes in checkSettings for gzip/bzip2. Worth it?

Oh but that would mean decompressing entirely. Probably not a good idea.

Apr 21 '20 22:04 idodeclare

lgtm, so +1 from my side but since this PR is big, I'd like to get at least one other review to be merged @vladak , @tulinkry ?

May 26 '20 08:05 tarzanek

Just rebased on master since this needed revision to accommodate the Configuration API changes of #3127 (but would not have shown up as having any Git conflicts)

May 27 '20 14:05 idodeclare

I will take a look; also needs rebase.

Aug 20 '20 14:08 vladak

Just trivial conflicts upon rebase

Aug 20 '20 17:08 idodeclare

Just rebasing for trivial conflicts related to R analyzer and then again after parallel detection merged

Oct 06 '20 17:10 idodeclare

Rebased for trivial conflict in search.jsp

Oct 07 '20 18:10 idodeclare

Rebased for PageConfig.java re-lo, and git automatic-merge took care of it

Oct 09 '20 15:10 idodeclare

opengrok opengrok copied to clipboard

Huge text handling

Pull Request Test Coverage Report for Build 5464

💛 - Coveralls

opengrok
opengrok copied to clipboard