spaCy
spaCy copied to clipboard
Feature Request: Pass custom values from Matcher pattern definitions to matched tokens
Discussed in https://github.com/explosion/spaCy/discussions/13519
Originally posted by apodgorny June 5, 2024 Consider a case where I need to tag FAX and TEL separately.
Tel: 24234-3433-3322 Fax: 24234-3433-3323
I currently have two options for NER with Matcher:
- Match
[{'LOWER': 'tel'}, {'ORTH': ':'}, {PATTERN_TO_MATCH_PHONE}]
- Match
[{PATTERN_TO_MATCH_PHONE}]
Neither case accomplishes the goal
- Has unnecessary extra tokens (that may be needed for additional unrelated tagging – I have a case to show as well)
- Does not distinguish between FAX and TEL
SOLUTION:
Token.set_extension('exclude', default=False, force=True)
patterns = [
{'LOWER': 'tel', '_': {'exclude': True}},
{'ORTH': ':', '_': {'exclude': True}},
{PATTERN_TO_MATCH_PHONE}
]
These custom values should be passed into tokens matched by call: matches = matcher(doc)
, to be able to distinguish between them based on pattern that matched like so doc[n]._.exclude == True
This would covers multiple cases that were previously hard or impossible to solve with SpaCy matcher:
- Matching by preceding tokens
- Matching by following tokens
- Matching complex pattern of tokens that appear in a constellation to tag them separately.
- Cascading match, where you tag items and match again relying on previously tagged entities, but not overwriting them
- Other potential cases, that I did not think of, but other could invent, that would benefit from possibility of passing data this way.
Thank you for awesome library – this addition would make it awesome-awesome :)
P.S. Extra credit :)
If we could do matches[n].tokens
it would be triple awesome