BlackLab icon indicating copy to clipboard operation
BlackLab copied to clipboard

Capture groups in sequences can fail to return everything expected

Open chrisc36 opened this issue 9 years ago • 4 comments

Queries where: 1: The capture group(s) can match a variable number of tokens 2: The non-capture-group part of the query also contains terms that can also match a variable number of tokens

can sometimes fail to return all the capture groups expected. This is caused by the fact sometimes capture groups can capture different Spans on an identical Hit, so the uniqueness filter in SpanQuerySequence.java will remove those results. Therefore if a query matches the same span of tokens as a previous Hit, but could capture different tokens within that span, that capture is dropped.

For example a query like "1:[pos="NOUN"]+ [pos="NOUN"]* d" on the sentence "a b c d" where "a-d" are nouns will capture "a b" and "b", but not "a".

chrisc36 avatar Jun 08 '15 17:06 chrisc36

Yes, I see. I guess the solution would be to make the uniqueness filter optional?

jan-niestadt avatar Jun 09 '15 07:06 jan-niestadt

Yes, but the changes would also have to percolate up to allow Hits.java to either handle non-unique Hit objects or store multiple captures for each Hit. There also might be issue in SpansBucketAbstract since it implicitly assumes each Hit only has only one capture group so this issue might be a bit tricky to solve.

chrisc36 avatar Jun 09 '15 17:06 chrisc36

I've looked into this a bit more and I see what you mean, and why it would be tricky to get this right. Maybe the "proper" solution is to subclass Hit to store captures and take them into account in equals(), compareTo() and hashCode(). Of course, that's a significant amount of work; I'm not sure if and when we'll get around to that.

jan-niestadt avatar Jun 23 '15 13:06 jan-niestadt

Reopening as this was closed in error. This is still an issue; we should think about solutions, or at the very least document it so people understand the issue and can work around it.

The essence of the problem is that if there's multiple ways to capture a group for the same hit, the uniqueness filter doesn't see these matches as unique and will remove all but one of them, leaving you with only one of the possible capture options.

This query illustrates the problem:

A:([] []?) [] []? "tree"

If the complete match is the phrase "big old oak tree", capture group A will either get the value "big" or "big old", but not both, because duplicate matches are removed (and the uniqueness filter doesn't look at capture groups).

We should think about either allowing duplicate matches if their capture groups differ, or allowing multiple sets of captures per match.

jan-niestadt avatar Aug 04 '22 07:08 jan-niestadt

Essentially the same problem but with a different type of query:

<s/> containing A:("cow"|"hare")

or

<s/> containing (A:"cow"|B:"hare")

This query will return hits capturing either cow or hare. But a sentence containing both will not capture both, because there can only be one capture at a time.

UPDATE: maybe this is correct behaviour? "containing" is a filter that decides which sentences to return. Wouldn't it be strange to return a sentence twice because it matched the filter twice? That doesn't seem like how the operator is intended. E.g. <s/> containing "the" shouldn't produce 3 hits for sentences where "the" occurs it three times, should it? If you want to capture both "cow" and "hare", you should probably use the within operator.

Improved uniqueness filter

Probably the way to deal with this is to allow multiple captures. This means that the uniqueness filter would recognize "identical" hits with different capture groups, discard all but one of those "identical" hits, but in the process, merge the capture groups options so we remember all possibilities.

How to extend the API response

To have the API return multiple capture group possibilities, one options would be to change the API so captures are always returned as a list of alternatives (usually only containing one possibility), but this would break compatibility to support a pretty uncommon case.

Maybe a better option is to add an alternateCaptureGroups key in addition to the captureGroups key. We would only set this key for hits with multiple possible captures. This maintains compatibility and doesn't complicate the response structure except for this relatively rare situation.

jan-niestadt avatar Feb 21 '23 13:02 jan-niestadt

We're now integrating capture groups (and other match info) into regular hits, which will allow us to be stricter about uniqueness. This should solve this issue. See branch experiment/unify-captures-relations (or feature/relations).

jan-niestadt avatar Apr 17 '23 11:04 jan-niestadt

The current solution will produce multiple hits with the same doc, start and end, but with different capture information (now called match info internally, because it is used for more than just group captures).

If we want to, we could still add an operation that merges hits with identical doc, start and end and collects all the possible match info.

jan-niestadt avatar May 17 '23 09:05 jan-niestadt

Above solution was merged into dev; closing this issue (finally!)

jan-niestadt avatar Sep 14 '23 12:09 jan-niestadt