spaCy
spaCy copied to clipboard
Non-greedy quantifiers ['*?', '+?']
This PR adds non-greedy quantifiers ['*?', '+?'] to spaCy matcher.
Description
Abstract
This pull request implements the non-greedy *? and +? quantifiers which returns as few token match instances as possible. Within the current spaCy matcher, * and + quantifiers returns all possible matches which might be more than the desired results.
As a part of this pull request, quantifier ZERO_MINUS is added and the existing quantifier ONE is modified to achieve *?: (ZERO_MINUS,) and +?: (ONE, ZERO_MINUS).
The non-greedy behaviour is achieved mainly via cast_to_non_greedy_action() which changes actions from get_action() into one that demonstrates non-greedy behaviour i.e. remove matches that are of greedy behaviour.
Additionally, this PR also can be further developed into the non-greedy version of the min_max quantifiers {m,n} ?.
Rationale
Accuracy
The current spacy matcher demonstrates "get all" behaviour for * and + where all possible matches are returned. However, this may create undesired results and require the user to filter the matches.
pattern = [{"IS_DIGIT": True}, {"ORTH": "."}, {"OP": "+?"}, {"ORTH": "\n"}]
matcher.add("Pattern", [pattern])
doc = nlp("""
These are the ingredients to make a soufflé:
1. 200g of flour
2. 2 eggs
3. 4 pods of vanilla
4. water
""")
Output:
"1. 200g of flour ", "2. 2 eggs", "3. 4 pods of vanilla", "4. water"
In the above example, the desired output is to return each line of ingredients to make the soufflé. If + quantifier was used, it would have also returned other results such as "1. 200g of flour 2. 2 eggs" and even the whole string.
Minimal or non-greedy quantifiers which match as few instances as possible would be able to achieve the desired output of only the 4 lines of ingredients as seen above.
Speed Improvements
Furthermore, the termination of branches once there is a match results in a faster run time when using non-greedy quantifiers.
Using the sample text and the code below, the greedy pattern consisting of syncategorematic words with greedy wildcards is always slower than the non-greedy counterpart.
Greedy pattern
{"TEXT": "a"}, {"OP": "+"}, {"TEXT": "the"}, {"OP": "+"}, {"TEXT": "of"}, {"OP": "+"}, {"TEXT": "to"}, {"OP": "+"}, {"TEXT": "be"}
Non-greedy pattern
{"TEXT": "a"}, {"OP": "+?"}, {"TEXT": "the"}, {"OP": "+?"}, {"TEXT": "of"}, {"OP": "+?"}, {"TEXT": "to"}, {"OP": "+?"}, {"TEXT": "be"}
One instance of the test using the sample text which is 670 words long is as follows: Greedy + : 0.47761 seconds Non-greedy +? : 0.39530 seconds
Speed test
import timeit
import_module = '''
import spacy
from spacy.matcher import Matcher
'''
testcode1 = '''
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
file_name = 'text.txt'
text = open(file_name).read()
pattern = [{"TEXT": "a"}, {"OP": "+"}, {"TEXT": "the"}, {"OP": "+"}, {"TEXT": "of"}, {"OP": "+"}, {"TEXT": "to"}, {"OP": "+"}, {"TEXT": "be"}]
matcher.add("Pattern", [pattern])
doc = nlp(text)
matches = matcher(doc)
'''
testcode2 = '''
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
file_name = 'text.txt'
text = open(file_name).read()
pattern = [{"TEXT": "a"}, {"OP": "+?"}, {"TEXT": "the"}, {"OP": "+?"}, {"TEXT": "of"}, {"OP": "+?"}, {"TEXT": "to"}, {"OP": "+?"}, {"TEXT": "be"}]
matcher.add("Pattern", [pattern])
doc = nlp(text)
matches = matcher(doc)
'''
print("Greedy + :", timeit.timeit(stmt=testcode1, setup=import_module, number=1))
print("Non-greedy +? : ", timeit.timeit(stmt=testcode2, setup=import_module, number=1))
The addition of the non-greedy quantifiers will thus simplify the process of using spaCy matcher and it would be a beneficial addition to spaCy matcher.
*? vs * Illustration
(Refer to spacy/tests/matcher/test_matcher_logic.py for more examples)
*?
string: "a a b"
pattern: a*? b*
result = ["a a b", "a b", "b"]
vs
*
string: "a a b"
pattern: a* b*
result = ["a", "a a", "a", "a a b", "a b", "b"]
Main changes
matcher.pyx and matcher.pxd
- Added Quantifier "ZERO_MINUS", modified "ONE"
- Added actions "EXTEND", "RETRY_OR_EXTEND" and "MATCH_ADVANCE"
- cast_to_non_greedy_action() function added
test_matcher_logic.py
- test_matcher_non_greedy_operator() added to demonstrate the non-greedy behaviour of "*?" and "+?"
Types of change
New feature
Checklist
- [x] I confirm that I have the right to submit this contribution under the project's MIT license.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
Thanks for the PR! Since this changes the Matcher algorithm, we may need some time to do a careful evaluation, especially to make sure it doesn't affect the current performance. To make it easier to review, could you undo all the unrelated formatting changes in matcher.pyx?
Thanks for the PR! Since this changes the
Matcheralgorithm, we may need some time to do a careful evaluation, especially to make sure it doesn't affect the current performance. To make it easier to review, could you undo all the unrelated formatting changes inmatcher.pyx?
Removed, thanks!
Thanks for the quick reply, we'll have a look!
In my first round of simple comparisons, the speed of the matcher for patterns involving * and + looks to be unchanged.
Let me make a few minor formatting edits to make the diff a bit cleaner so it's easier to review all the changes to the algorithm.
Hello! Are there any updates for this? Thank you :)
Apologies for the huge delay here! We're definitely still considering this for v3.5 and I have been working on profiling the performance for existing Matcher workloads.
Hi @shen-qin, I want to thank you again for all the work & careful explanations you've put into this proposal.
Unfortunately, we've decided that we won't be merging this functionality into spaCy's core code base right now. We've had some internal discussion around this, but found the overal behaviour of these new quantifiers, especially when combined with spaCy's existing greedy quantifiers, simply too confusing & unexpected. The fact that they behave in an asymetric way and that you really can only understand the behaviour once you start thinking about all the details of the algorithmic implementation, makes it very difficult to come up with easy-to-explain use-cases for documentation purposes. We think it'll be too confusing to be practically useful and to warrant the maintenance costs that will be associated to having this feature.
I'm sorry that this is the conclusion we've reached, and I want to repeat that we really do appreciate your efforts on this, so we hope this won't discourage you from contributing in the future!
Hi @svlandeg and team, thank you so much for the explanation. Really sad that this did not go through but I do understand. Once again thank you!