BlackLab
BlackLab copied to clipboard
Indexing with position gaps instead of storing empty strings is questionable
To reduce index size for sparse annotations, BlackLab recognizes when empty strings are indexed for a number of successive tokens and replaces this with a position gap (not storing anything for these positions where the value was empty). But this does affect searches for the empty string. So for example, we cannot find unlemmatized documents using [lemma=""]{3} anymore.
We could make this an option. It should probably default to off as that is the least surprising and most correct behaviour (if the value is empty, index an empty value), but you can enable it if you have a lot of sparse annotations and don't mind the downside.
On reflection, I would be okay with leaving the current default and having a (documented!) configuration option to disable this optimization. Searching for the empty string doesn't seem like a reasonable feature, outside of very particular situations.