opengrok icon indicating copy to clipboard operation
opengrok copied to clipboard

Huge text handling

Open idodeclare opened this issue 4 years ago • 16 comments

Hello,

Please consider for integration this patch to add Huge Text file handling.

Indexer and Configuration get two new settings, hugeTextThresholdBytes (default 1_000_000) and hugeTextLimitCharacters (default 5_000_000). The threshold determines when OpenGrok will override a PLAIN genre file as a hugetext DATA file instead. The character limit determines how much to read and index for hugetext (with contextless truncation); the limit may be zero.

hugeTextThresholdBytes is checked for applicable files with each run, while no state for hugeTextLimitCharacters is stored. Changing hugeTextLimitCharacters after indexing would require touching affected source code files to revise the index.

For affected gzip and bzip2 files, changes to either hugeTextThresholdBytes or hugeTextLimitCharacters would require touching affected compressed files to revise the index.

Thank you.

idodeclare avatar Apr 18 '20 04:04 idodeclare

Pull Request Test Coverage Report for Build 5464

  • 122 of 219 (55.71%) changed or added relevant lines in 17 files are covered.
  • 7 unchanged lines in 5 files lost coverage.
  • Overall coverage decreased (-0.07%) to 75.639%

Changes Missing Coverage Covered Lines Changed/Added Lines %
opengrok-indexer/src/main/java/org/opengrok/indexer/util/LimitedReader.java 18 19 94.74%
opengrok-indexer/src/main/java/org/opengrok/indexer/analysis/StreamSource.java 0 2 0.0%
opengrok-indexer/src/main/java/org/opengrok/indexer/analysis/archive/GZIPAnalyzer.java 2 4 50.0%
opengrok-indexer/src/main/java/org/opengrok/indexer/index/IndexDatabase.java 17 19 89.47%
opengrok-indexer/src/main/java/org/opengrok/indexer/analysis/archive/BZip2Analyzer.java 5 8 62.5%
opengrok-indexer/src/main/java/org/opengrok/indexer/index/Indexer.java 7 10 70.0%
opengrok-indexer/src/main/java/org/opengrok/indexer/web/SingleResult.java 0 6 0.0%
opengrok-web/src/main/java/org/opengrok/web/PageConfig.java 3 14 21.43%
opengrok-indexer/src/main/java/org/opengrok/indexer/web/SearchHelper.java 0 12 0.0%
opengrok-indexer/src/main/java/org/opengrok/indexer/analysis/AnalyzerGuru.java 25 43 58.14%
<!-- Total: 122 219
Files with Coverage Reduction New Missed Lines %
opengrok-indexer/src/main/java/org/opengrok/indexer/analysis/AnalyzerGuru.java 1 84.21%
opengrok-indexer/src/main/java/org/opengrok/indexer/index/IndexDatabase.java 1 60.19%
opengrok-indexer/src/main/java/org/opengrok/indexer/index/Indexer.java 1 51.3%
opengrok-indexer/src/main/java/org/opengrok/indexer/configuration/IndexTimestamp.java 2 42.86%
opengrok-web/src/main/java/org/opengrok/web/PageConfig.java 2 39.1%
<!-- Total: 7
Totals Coverage Status
Change from base Build 5462: -0.07%
Covered Lines: 41408
Relevant Lines: 54744

💛 - Coveralls

coveralls avatar Apr 18 '20 05:04 coveralls

this actually looks good, I just quickly skimmed through

maybe only one concern, I think we should WARN instead of FINE print if a file is skipped because of limits ... (unless FINE is printed by default to log file ... but then I'd like to see those as WARN on console too ... )

tarzanek avatar Apr 20 '20 18:04 tarzanek

What happens if index is created with particular limits and then the limits are changed ?

vladak avatar Apr 20 '20 19:04 vladak

maybe only one concern, I think we should WARN instead of FINE print if a file is skipped because of limits ... (unless FINE is printed by default to log file ... but then I'd like to see those as WARN on console too ... )

No file is skipped. It is still included but under the HugeTextAnalyzer where the user chooses how much to analyze (as low as 0). It also gets no xref.

idodeclare avatar Apr 20 '20 21:04 idodeclare

What happens if index is created with particular limits and then the limits are changed ?

I tried to describe that above, but to clarify:

You can change hugeTextThresholdBytes and then re-index, and (as applicable) PLAIN files will be reclassified as hugetext DATA or hugetext DATA will be re-classified to PLAIN.

If you change hugeTextLimitCharacters and re-index, then nothing will happen because no state is stored related to that limit.

PLAIN files inside gzip or bzip2 are handled by HugeTextAnalyzer, but changing hugeTextThresholdBytes and re-indexing does nothing because the uncompressed size is not known when checkSettings would need to decide. (Changing hugeTextLimitCharacters and reindexing also does nothing for gzip or bzip2, for the same reason above.)

idodeclare avatar Apr 20 '20 21:04 idodeclare

sorry, I meant "trimmed" down, not skipped anyways, logs should say it loud that file x and y are now not considered fully I don't expect this output will flood the screen, but it should be a warning to get printed on console

tarzanek avatar Apr 21 '20 06:04 tarzanek

@tarzanek , that's done.

idodeclare avatar Apr 21 '20 22:04 idodeclare

PLAIN files inside gzip or bzip2 are handled by HugeTextAnalyzer, but changing hugeTextThresholdBytes and re-indexing does nothing because the uncompressed size is not known when checkSettings would need to decide.

I suppose it would be straight-forward to store a value for uncompressed size in the Document so that we could compare against hugeTextThresholdBytes in checkSettings for gzip/bzip2. Worth it?

idodeclare avatar Apr 21 '20 22:04 idodeclare

PLAIN files inside gzip or bzip2 are handled by HugeTextAnalyzer, but changing hugeTextThresholdBytes and re-indexing does nothing because the uncompressed size is not known when checkSettings would need to decide.

I suppose it would be straight-forward to store a value for uncompressed size in the Document so that we could compare against hugeTextThresholdBytes in checkSettings for gzip/bzip2. Worth it?

Oh but that would mean decompressing entirely. Probably not a good idea.

idodeclare avatar Apr 21 '20 22:04 idodeclare

lgtm, so +1 from my side but since this PR is big, I'd like to get at least one other review to be merged @vladak , @tulinkry ?

tarzanek avatar May 26 '20 08:05 tarzanek

Just rebased on master since this needed revision to accommodate the Configuration API changes of #3127 (but would not have shown up as having any Git conflicts)

idodeclare avatar May 27 '20 14:05 idodeclare

I will take a look; also needs rebase.

vladak avatar Aug 20 '20 14:08 vladak

Just trivial conflicts upon rebase

idodeclare avatar Aug 20 '20 17:08 idodeclare

Just rebasing for trivial conflicts related to R analyzer and then again after parallel detection merged

idodeclare avatar Oct 06 '20 17:10 idodeclare

Rebased for trivial conflict in search.jsp

idodeclare avatar Oct 07 '20 18:10 idodeclare

Rebased for PageConfig.java re-lo, and git automatic-merge took care of it

idodeclare avatar Oct 09 '20 15:10 idodeclare