BlackLab icon indicating copy to clipboard operation
BlackLab copied to clipboard

Extra closing token can (in theory) be matched

Open jan-niestadt opened this issue 4 years ago • 4 comments

If you search for [lemma=""], it is possible to match the extra closing token at the end of a document (that exists to store the last bit of punctuation and any final closing tags):

http://svatec10.ivdnt.loc/corpus-frontend/chn-extern/search/hits?first=2796280&number=20&patt=%5Blemma%3D%22%22%5D&interface=%7B%22form%22%3A%22search%22%2C%22patternMode%22%3A%22expert%22%7D

jan-niestadt avatar Sep 17 '21 09:09 jan-niestadt

This is not a big problem in practice, because this is not a reasonable query. And when searching for [] for example, the extra closing token is not matched. But it would be better to make sure it can't happen even with unusual queries.

This might be fixed by not adding empty values for annotations except punct and starttag at the extra closing token position. But we need to make sure this doesn't cause problems elsewhere. It shouldn't, because document length is determined from a stored field, not from the number of indexed positions an annotation has, but this should be verified.

The relevant code is in DocIndexerBase. We should essentially add one fewer empty strings here.

jan-niestadt avatar Sep 17 '21 11:09 jan-niestadt

It shouldn't, because document length is determined from a stored field, not from the number of indexed positions an annotation has, but this should be verified.

I know of one place where this does happen: https://github.com/INL/BlackLab/blob/dev/engine/src/main/java/nl/inl/blacklab/search/results/HitGroupsTokenFrequencies.java#L225

KCMertens avatar Sep 21 '21 08:09 KCMertens

I've created #217 to always determine the doc length the same way. Does that look right to you?

(had a false start BTW so recreated the correct pull request)

jan-niestadt avatar Sep 21 '21 12:09 jan-niestadt

I think it's probably best to always keep including the extra closing token, so all annotations have the same doc length. If the punct and tag annotations are always one longer than all the others, this quickly gets confusing.

Maybe we could index some magic (almost) unmatchable value instead of an empty string at this position. We'd have to make sure this doesn't cause any new problems though.

jan-niestadt avatar Feb 18 '22 14:02 jan-niestadt