BlackLab icon indicating copy to clipboard operation
BlackLab copied to clipboard

Add a setting for how to deal with overlapping matches

Open jan-niestadt opened this issue 7 months ago • 0 comments

Certain queries such as [lemma="cat"] [lemma!="dog"]{10} can produce a bunch of overlapping hits (cat followed by 1 non-dog; cat followed by 2 non-dogs; etc.). For certain queries, you want all the possibilities, but for others, you would prefer it if these hits were filtered to just include the ones most relevant to you.

This is somewhat similar to how regex engines usually have greedy, reluctant and possessive matching modes (see e.g. here), although replicating those exact behaviours in BlackLab would be challenging, because it finds matches in a different way, using the reverse index.

There are many ways BlackLab could filter out certain overlapping hits, e.g.:

  • keep everything (this is how it currently works)
  • for hits with the same start position, discard all but the longest (or shortest) (but giving start position a special meaning seems arbitrary)
  • when two hits overlap, keep the one that starts the earliest in the document; discard the other (again, seems arbitrary)
  • discard any hits that are fully contained in another hit (or that fully contain another hit)
  • when two hits (partially or fully) overlap, keep the longest (or shortest); discard the other

We should try to support some of the most helpful modes.

(via @franklandsbergen)

jan-niestadt avatar Jul 02 '24 13:07 jan-niestadt