Tool invert argument `--term-count` should be a required argument
The documentation for invert mentions that the term count argument is required. But this is not the case in the actual tool arguments in tools/invert.cpp. Need to confirm which one it should be.
Docs. https://github.com/pisa-engine/pisa/blob/07d63477462e68cfeb18f624278df545519284c4/docs/src/guide/inverting.md?plain=1#L12
tools/invert.
$ ./build/debug/bin/invert --help
Constructs an inverted index from a forward index.
Usage: ../../pisa/build/debug/bin/invert [OPTIONS]
Options:
-h,--help Print this help message and exit
-i,--input TEXT REQUIRED Forward index basename
-o,--output TEXT REQUIRED Output inverted index basename
--term-count UINT Number of distinct terms in the forward index
-j,--threads UINT Number of threads
--batch-size UINT [100000] Number of documents to process at a time
-L,--log-level TEXT:{critical,debug,err,info,off,trace,warn} [info]
Log level
--config Configuration .ini file
I think it's the documentation that is not accurate. If the term count is not defined, it should read it from the term lexicon: https://github.com/pisa-engine/pisa/blob/main/src/invert.cpp#L241
I think this should probably be reflected in the docs (including the CLI help) rather than term count being required.
Ideally, we wouldn't have the argument at all, as the lexicon should be built when the forward index is created. But I'm not 100% sure there's a way for it not to be there (except for the obvious, manual file deletion).
This goes back to when we didn't have the lexicon at this time, but only new-line separated term list, which would be slow to read and count terms, so the parameter was introduced. Though honestly, idk how else you would know the count... even the documentation says to run wc -l 🤷
I think another consideration is that we never really had a real naming standard or a metadata file that tells you when lexicon is, it could technically be somewhere else, and currently the code guesses using convention.
The immediate fix, imo, should be to fix the documentation.
Great, thank for the detailed response and background regarding why it might be the way that it is. Will take a look at fixing up the documentation for this.