Extra closing token can (in theory) be matched
If you search for [lemma=""], it is possible to match the extra closing token at the end of a document (that exists to store the last bit of punctuation and any final closing tags):
http://svatec10.ivdnt.loc/corpus-frontend/chn-extern/search/hits?first=2796280&number=20&patt=%5Blemma%3D%22%22%5D&interface=%7B%22form%22%3A%22search%22%2C%22patternMode%22%3A%22expert%22%7D
This is not a big problem in practice, because this is not a reasonable query. And when searching for [] for example, the extra closing token is not matched. But it would be better to make sure it can't happen even with unusual queries.
This might be fixed by not adding empty values for annotations except punct and starttag at the extra closing token position. But we need to make sure this doesn't cause problems elsewhere. It shouldn't, because document length is determined from a stored field, not from the number of indexed positions an annotation has, but this should be verified.
The relevant code is in DocIndexerBase. We should essentially add one fewer empty strings here.
It shouldn't, because document length is determined from a stored field, not from the number of indexed positions an annotation has, but this should be verified.
I know of one place where this does happen: https://github.com/INL/BlackLab/blob/dev/engine/src/main/java/nl/inl/blacklab/search/results/HitGroupsTokenFrequencies.java#L225
I've created #217 to always determine the doc length the same way. Does that look right to you?
(had a false start BTW so recreated the correct pull request)
I think it's probably best to always keep including the extra closing token, so all annotations have the same doc length. If the punct and tag annotations are always one longer than all the others, this quickly gets confusing.
Maybe we could index some magic (almost) unmatchable value instead of an empty string at this position. We'd have to make sure this doesn't cause any new problems though.