opengrok
opengrok copied to clipboard
Huge text handling
Hello,
Please consider for integration this patch to add Huge Text file handling.
Indexer
and Configuration
get two new settings, hugeTextThresholdBytes
(default 1_000_000) and hugeTextLimitCharacters
(default 5_000_000). The threshold determines when OpenGrok will override a PLAIN
genre file as a hugetext
DATA
file instead. The character limit determines how much to read and index for hugetext
(with contextless truncation); the limit may be zero.
hugeTextThresholdBytes
is checked for applicable files with each run, while no state for hugeTextLimitCharacters
is stored. Changing hugeTextLimitCharacters
after indexing would require touching affected source code files to revise the index.
For affected gzip and bzip2 files, changes to either hugeTextThresholdBytes
or hugeTextLimitCharacters
would require touching affected compressed files to revise the index.
Thank you.
Pull Request Test Coverage Report for Build 5464
- 122 of 219 (55.71%) changed or added relevant lines in 17 files are covered.
- 7 unchanged lines in 5 files lost coverage.
- Overall coverage decreased (-0.07%) to 75.639%
Changes Missing Coverage | Covered Lines | Changed/Added Lines | % |
---|---|---|---|
opengrok-indexer/src/main/java/org/opengrok/indexer/util/LimitedReader.java | 18 | 19 | 94.74% |
opengrok-indexer/src/main/java/org/opengrok/indexer/analysis/StreamSource.java | 0 | 2 | 0.0% |
opengrok-indexer/src/main/java/org/opengrok/indexer/analysis/archive/GZIPAnalyzer.java | 2 | 4 | 50.0% |
opengrok-indexer/src/main/java/org/opengrok/indexer/index/IndexDatabase.java | 17 | 19 | 89.47% |
opengrok-indexer/src/main/java/org/opengrok/indexer/analysis/archive/BZip2Analyzer.java | 5 | 8 | 62.5% |
opengrok-indexer/src/main/java/org/opengrok/indexer/index/Indexer.java | 7 | 10 | 70.0% |
opengrok-indexer/src/main/java/org/opengrok/indexer/web/SingleResult.java | 0 | 6 | 0.0% |
opengrok-web/src/main/java/org/opengrok/web/PageConfig.java | 3 | 14 | 21.43% |
opengrok-indexer/src/main/java/org/opengrok/indexer/web/SearchHelper.java | 0 | 12 | 0.0% |
opengrok-indexer/src/main/java/org/opengrok/indexer/analysis/AnalyzerGuru.java | 25 | 43 | 58.14% |
<!-- | Total: | 122 | 219 |
Files with Coverage Reduction | New Missed Lines | % |
---|---|---|
opengrok-indexer/src/main/java/org/opengrok/indexer/analysis/AnalyzerGuru.java | 1 | 84.21% |
opengrok-indexer/src/main/java/org/opengrok/indexer/index/IndexDatabase.java | 1 | 60.19% |
opengrok-indexer/src/main/java/org/opengrok/indexer/index/Indexer.java | 1 | 51.3% |
opengrok-indexer/src/main/java/org/opengrok/indexer/configuration/IndexTimestamp.java | 2 | 42.86% |
opengrok-web/src/main/java/org/opengrok/web/PageConfig.java | 2 | 39.1% |
<!-- | Total: | 7 |
Totals | |
---|---|
Change from base Build 5462: | -0.07% |
Covered Lines: | 41408 |
Relevant Lines: | 54744 |
💛 - Coveralls
this actually looks good, I just quickly skimmed through
maybe only one concern, I think we should WARN instead of FINE print if a file is skipped because of limits ... (unless FINE is printed by default to log file ... but then I'd like to see those as WARN on console too ... )
What happens if index is created with particular limits and then the limits are changed ?
maybe only one concern, I think we should WARN instead of FINE print if a file is skipped because of limits ... (unless FINE is printed by default to log file ... but then I'd like to see those as WARN on console too ... )
No file is skipped. It is still included but under the HugeTextAnalyzer
where the user chooses how much to analyze (as low as 0). It also gets no xref.
What happens if index is created with particular limits and then the limits are changed ?
I tried to describe that above, but to clarify:
You can change hugeTextThresholdBytes
and then re-index, and (as applicable) PLAIN
files will be reclassified as hugetext
DATA
or hugetext
DATA
will be re-classified to PLAIN
.
If you change hugeTextLimitCharacters
and re-index, then nothing will happen because no state is stored related to that limit.
PLAIN
files inside gzip or bzip2 are handled by HugeTextAnalyzer
, but changing hugeTextThresholdBytes
and re-indexing does nothing because the uncompressed size is not known when checkSettings
would need to decide. (Changing hugeTextLimitCharacters
and reindexing also does nothing for gzip or bzip2, for the same reason above.)
sorry, I meant "trimmed" down, not skipped anyways, logs should say it loud that file x and y are now not considered fully I don't expect this output will flood the screen, but it should be a warning to get printed on console
@tarzanek , that's done.
PLAIN
files inside gzip or bzip2 are handled byHugeTextAnalyzer
, but changinghugeTextThresholdBytes
and re-indexing does nothing because the uncompressed size is not known whencheckSettings
would need to decide.
I suppose it would be straight-forward to store a value for uncompressed size in the Document
so that we could compare against hugeTextThresholdBytes
in checkSettings
for gzip/bzip2. Worth it?
PLAIN
files inside gzip or bzip2 are handled byHugeTextAnalyzer
, but changinghugeTextThresholdBytes
and re-indexing does nothing because the uncompressed size is not known whencheckSettings
would need to decide.I suppose it would be straight-forward to store a value for uncompressed size in the
Document
so that we could compare againsthugeTextThresholdBytes
incheckSettings
for gzip/bzip2. Worth it?
Oh but that would mean decompressing entirely. Probably not a good idea.
lgtm, so +1 from my side but since this PR is big, I'd like to get at least one other review to be merged @vladak , @tulinkry ?
Just rebased on master since this needed revision to accommodate the Configuration
API changes of #3127 (but would not have shown up as having any Git conflicts)
I will take a look; also needs rebase.
Just trivial conflicts upon rebase
Just rebasing for trivial conflicts related to R analyzer and then again after parallel detection merged
Rebased for trivial conflict in search.jsp
Rebased for PageConfig.java re-lo, and git automatic-merge took care of it