BlackLab icon indicating copy to clipboard operation
BlackLab copied to clipboard

Indexing with position gaps instead of storing empty strings is questionable

Open jan-niestadt opened this issue 4 years ago • 2 comments

To reduce index size for sparse annotations, BlackLab recognizes when empty strings are indexed for a number of successive tokens and replaces this with a position gap (not storing anything for these positions where the value was empty). But this does affect searches for the empty string. So for example, we cannot find unlemmatized documents using [lemma=""]{3} anymore.

jan-niestadt avatar Mar 22 '21 09:03 jan-niestadt

We could make this an option. It should probably default to off as that is the least surprising and most correct behaviour (if the value is empty, index an empty value), but you can enable it if you have a lot of sparse annotations and don't mind the downside.

jan-niestadt avatar Mar 22 '21 09:03 jan-niestadt

On reflection, I would be okay with leaving the current default and having a (documented!) configuration option to disable this optimization. Searching for the empty string doesn't seem like a reasonable feature, outside of very particular situations.

jan-niestadt avatar May 24 '23 13:05 jan-niestadt