anserini icon indicating copy to clipboard operation
anserini copied to clipboard

Unique terms not available in IndexReaderUtils

Open djoerd opened this issue 2 years ago • 2 comments

I want to know the number of unique terms in my index and got: -1

Steps: IndexCollection -collection TrecCollection -input /home/hiemstra/Data/robust04/ -index lucene-index.robust04.pos+docvectors -threads 16 -storePositions -storeDocvectors IndexReaderUtils -stats -index lucene-index.robust04.pos+docvectors/

Results: Index statistics ---------------- documents: 528030 documents (non-empty): 528030 unique terms: -1 total terms: 174540872

Turns out that: "Terms.size(): (...) may be unavailable (returns -1) for some Terms implementations such as MultiTerms where it cannot be efficiently computed.

I already solved this myself: I will add a pull request.

djoerd avatar Jan 23 '23 10:01 djoerd

To get an accurate count of the vocab size, you have to use the -optimize flag, which merges all the index segments down into a single one.

lintool avatar Jan 30 '23 21:01 lintool

Thanks a lot (also for doing this on SIGIR deadline day!): That solves my problem. The optimize flag is costly for a large index though, so the PR may still be helpful. I checked and it gives the exact same number of unique terms for my (non-optimized) Robust04 index: 923436 unique terms, i.e., the Lucene term iterator seems to work correctly on multiple segments.

BTW, I forgot to remove the "terms" declaration from the original getIndexStats() method (should have installed Eclipse directly)

djoerd avatar Jan 31 '23 14:01 djoerd