pisa icon indicating copy to clipboard operation
pisa copied to clipboard

Tool invert argument `--term-count` should be a required argument

Open lgrz opened this issue 5 months ago • 2 comments

The documentation for invert mentions that the term count argument is required. But this is not the case in the actual tool arguments in tools/invert.cpp. Need to confirm which one it should be.

Docs. https://github.com/pisa-engine/pisa/blob/07d63477462e68cfeb18f624278df545519284c4/docs/src/guide/inverting.md?plain=1#L12

tools/invert.

$ ./build/debug/bin/invert --help
Constructs an inverted index from a forward index.
Usage: ../../pisa/build/debug/bin/invert [OPTIONS]

Options:
  -h,--help                   Print this help message and exit
  -i,--input TEXT REQUIRED    Forward index basename
  -o,--output TEXT REQUIRED   Output inverted index basename
  --term-count UINT           Number of distinct terms in the forward index
  -j,--threads UINT           Number of threads
  --batch-size UINT [100000]  Number of documents to process at a time
  -L,--log-level TEXT:{critical,debug,err,info,off,trace,warn} [info] 
                              Log level
  --config                    Configuration .ini file

lgrz avatar Aug 02 '25 11:08 lgrz

I think it's the documentation that is not accurate. If the term count is not defined, it should read it from the term lexicon: https://github.com/pisa-engine/pisa/blob/main/src/invert.cpp#L241

I think this should probably be reflected in the docs (including the CLI help) rather than term count being required.

Ideally, we wouldn't have the argument at all, as the lexicon should be built when the forward index is created. But I'm not 100% sure there's a way for it not to be there (except for the obvious, manual file deletion).

This goes back to when we didn't have the lexicon at this time, but only new-line separated term list, which would be slow to read and count terms, so the parameter was introduced. Though honestly, idk how else you would know the count... even the documentation says to run wc -l 🤷

I think another consideration is that we never really had a real naming standard or a metadata file that tells you when lexicon is, it could technically be somewhere else, and currently the code guesses using convention.

The immediate fix, imo, should be to fix the documentation.

elshize avatar Aug 09 '25 21:08 elshize

Great, thank for the detailed response and background regarding why it might be the way that it is. Will take a look at fixing up the documentation for this.

lgrz avatar Aug 18 '25 11:08 lgrz