spaCy icon indicating copy to clipboard operation
spaCy copied to clipboard

Feature Request: Pass custom values from Matcher pattern definitions to matched tokens

Open apodgorny opened this issue 1 month ago • 0 comments

Discussed in https://github.com/explosion/spaCy/discussions/13519

Originally posted by apodgorny June 5, 2024 Consider a case where I need to tag FAX and TEL separately.

Tel: 24234-3433-3322 Fax: 24234-3433-3323

I currently have two options for NER with Matcher:

  1. Match [{'LOWER': 'tel'}, {'ORTH': ':'}, {PATTERN_TO_MATCH_PHONE}]
  2. Match [{PATTERN_TO_MATCH_PHONE}]

Neither case accomplishes the goal

  1. Has unnecessary extra tokens (that may be needed for additional unrelated tagging – I have a case to show as well)
  2. Does not distinguish between FAX and TEL

SOLUTION:

Token.set_extension('exclude', default=False, force=True)
patterns = [
    {'LOWER': 'tel', '_': {'exclude': True}}, 
    {'ORTH': ':', '_': {'exclude': True}}, 
    {PATTERN_TO_MATCH_PHONE}
]

These custom values should be passed into tokens matched by call: matches = matcher(doc), to be able to distinguish between them based on pattern that matched like so doc[n]._.exclude == True

This would covers multiple cases that were previously hard or impossible to solve with SpaCy matcher:

  1. Matching by preceding tokens
  2. Matching by following tokens
  3. Matching complex pattern of tokens that appear in a constellation to tag them separately.
  4. Cascading match, where you tag items and match again relying on previously tagged entities, but not overwriting them
  5. Other potential cases, that I did not think of, but other could invent, that would benefit from possibility of passing data this way.

Thank you for awesome library – this addition would make it awesome-awesome :)

P.S. Extra credit :) If we could do matches[n].tokens it would be triple awesome

apodgorny avatar Jun 05 '24 17:06 apodgorny