spaCy Non-greedy quantifiers ['*?', '+?']

trafficstars

This PR adds non-greedy quantifiers ['*?', '+?'] to spaCy matcher.

Description

Abstract

This pull request implements the non-greedy *? and +? quantifiers which returns as few token match instances as possible. Within the current spaCy matcher, * and + quantifiers returns all possible matches which might be more than the desired results.

As a part of this pull request, quantifier ZERO_MINUS is added and the existing quantifier ONE is modified to achieve *?: (ZERO_MINUS,) and +?: (ONE, ZERO_MINUS).

The non-greedy behaviour is achieved mainly via cast_to_non_greedy_action() which changes actions from get_action() into one that demonstrates non-greedy behaviour i.e. remove matches that are of greedy behaviour.

Additionally, this PR also can be further developed into the non-greedy version of the min_max quantifiers {m,n} ?.

Rationale

Accuracy

The current spacy matcher demonstrates "get all" behaviour for * and + where all possible matches are returned. However, this may create undesired results and require the user to filter the matches.

pattern = [{"IS_DIGIT": True}, {"ORTH": "."}, {"OP": "+?"}, {"ORTH": "\n"}]
matcher.add("Pattern", [pattern])
doc = nlp("""
These are the ingredients to make a soufflé: 
1. 200g of flour 
2. 2 eggs 
3. 4 pods of vanilla 
4. water
""")

Output: 
"1. 200g of flour ", "2. 2 eggs", "3. 4 pods of vanilla", "4. water"

In the above example, the desired output is to return each line of ingredients to make the soufflé. If + quantifier was used, it would have also returned other results such as "1. 200g of flour 2. 2 eggs" and even the whole string.

Minimal or non-greedy quantifiers which match as few instances as possible would be able to achieve the desired output of only the 4 lines of ingredients as seen above.

Speed Improvements

Furthermore, the termination of branches once there is a match results in a faster run time when using non-greedy quantifiers.

Using the sample text and the code below, the greedy pattern consisting of syncategorematic words with greedy wildcards is always slower than the non-greedy counterpart.

Greedy pattern
{"TEXT": "a"}, {"OP": "+"}, {"TEXT": "the"}, {"OP": "+"}, {"TEXT": "of"}, {"OP": "+"}, {"TEXT": "to"}, {"OP": "+"}, {"TEXT": "be"}
Non-greedy pattern
{"TEXT": "a"}, {"OP": "+?"}, {"TEXT": "the"}, {"OP": "+?"}, {"TEXT": "of"}, {"OP": "+?"}, {"TEXT": "to"}, {"OP": "+?"}, {"TEXT": "be"}

One instance of the test using the sample text which is 670 words long is as follows: Greedy + : 0.47761 seconds Non-greedy +? : 0.39530 seconds

Speed test

import timeit

import_module = '''
import spacy
from spacy.matcher import Matcher
'''

testcode1 = '''
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
file_name = 'text.txt'
text = open(file_name).read()
pattern = [{"TEXT": "a"}, {"OP": "+"}, {"TEXT": "the"}, {"OP": "+"}, {"TEXT": "of"}, {"OP": "+"}, {"TEXT": "to"}, {"OP": "+"}, {"TEXT": "be"}]
matcher.add("Pattern", [pattern])
doc = nlp(text)
matches = matcher(doc)
'''

testcode2 = '''
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
file_name = 'text.txt'
text = open(file_name).read()
pattern = [{"TEXT": "a"}, {"OP": "+?"}, {"TEXT": "the"}, {"OP": "+?"}, {"TEXT": "of"}, {"OP": "+?"}, {"TEXT": "to"}, {"OP": "+?"}, {"TEXT": "be"}]
matcher.add("Pattern", [pattern])
doc = nlp(text)
matches = matcher(doc)
'''

print("Greedy + :", timeit.timeit(stmt=testcode1, setup=import_module, number=1))
print("Non-greedy +? : ", timeit.timeit(stmt=testcode2, setup=import_module, number=1))

The addition of the non-greedy quantifiers will thus simplify the process of using spaCy matcher and it would be a beneficial addition to spaCy matcher.

`?` vs `` Illustration

(Refer to spacy/tests/matcher/test_matcher_logic.py for more examples)

*?
string: "a a b" 
pattern: a*? b*
result = ["a a b",  "a b",  "b"]

vs

*
string: "a a b" 
pattern: a* b*
result = ["a", "a a", "a", "a a b",  "a b",  "b"]

Main changes

matcher.pyx and matcher.pxd

Added Quantifier "ZERO_MINUS", modified "ONE"
Added actions "EXTEND", "RETRY_OR_EXTEND" and "MATCH_ADVANCE"
cast_to_non_greedy_action() function added

test_matcher_logic.py

test_matcher_non_greedy_operator() added to demonstrate the non-greedy behaviour of "*?" and "+?"

Types of change

New feature

Checklist

[x] I confirm that I have the right to submit this contribution under the project's MIT license.
[x] I ran the tests, and all new and existing tests passed.
[x] My changes don't require a change to the documentation, or if they do, I've added all required information.

Jul 01 '22 18:07 shen-qin

Thanks for the PR! Since this changes the Matcher algorithm, we may need some time to do a careful evaluation, especially to make sure it doesn't affect the current performance. To make it easier to review, could you undo all the unrelated formatting changes in matcher.pyx?

Jul 05 '22 08:07 adrianeboyd

Thanks for the PR! Since this changes the Matcher algorithm, we may need some time to do a careful evaluation, especially to make sure it doesn't affect the current performance. To make it easier to review, could you undo all the unrelated formatting changes in matcher.pyx?

Removed, thanks!

Jul 05 '22 09:07 shen-qin

Thanks for the quick reply, we'll have a look!

Jul 06 '22 07:07 adrianeboyd

In my first round of simple comparisons, the speed of the matcher for patterns involving * and + looks to be unchanged.

Let me make a few minor formatting edits to make the diff a bit cleaner so it's easier to review all the changes to the algorithm.

Aug 03 '22 12:08 adrianeboyd

Hello! Are there any updates for this? Thank you :)

Nov 24 '22 06:11 shen-qin

Apologies for the huge delay here! We're definitely still considering this for v3.5 and I have been working on profiling the performance for existing Matcher workloads.

Nov 24 '22 08:11 adrianeboyd

Hi @shen-qin, I want to thank you again for all the work & careful explanations you've put into this proposal.

Unfortunately, we've decided that we won't be merging this functionality into spaCy's core code base right now. We've had some internal discussion around this, but found the overal behaviour of these new quantifiers, especially when combined with spaCy's existing greedy quantifiers, simply too confusing & unexpected. The fact that they behave in an asymetric way and that you really can only understand the behaviour once you start thinking about all the details of the algorithmic implementation, makes it very difficult to come up with easy-to-explain use-cases for documentation purposes. We think it'll be too confusing to be practically useful and to warrant the maintenance costs that will be associated to having this feature.

I'm sorry that this is the conclusion we've reached, and I want to repeat that we really do appreciate your efforts on this, so we hope this won't discourage you from contributing in the future!

Jan 09 '23 10:01 svlandeg

Hi @svlandeg and team, thank you so much for the explanation. Really sad that this did not go through but I do understand. Once again thank you!

Jan 23 '23 09:01 shen-qin

spaCy spaCy copied to clipboard

Non-greedy quantifiers ['*?', '+?']

Description

Abstract

Rationale

Accuracy

Speed Improvements

Speed test

*? vs * Illustration

Main changes

Types of change

Checklist

spaCy
spaCy copied to clipboard

`?` vs `` Illustration