spaczz icon indicating copy to clipboard operation
spaczz copied to clipboard

Get original matched pattern back

Open Lostincodes opened this issue 3 years ago • 4 comments

Hi, A very useful feature would be to have the original pattern matched by SpaczzRuler, because when similar patterns are added, there may be doubts about which one is the original pattern matched. I guess this issue connects to a potential link with spacy knowledge base id's. Thank you

Lostincodes avatar Mar 16 '21 11:03 Lostincodes

Hi @Lostincodes, I've been in a busy stretch at work so thanks for your patience.

So this currently is not a part of spaczz because it isn't part of spaCy as far as I know. That being said it isn't as important for spaCy because generally the matches with spaCy components exactly match their patterns, which obviously is not the case for spaczz.

I've gotten this request a couple times now so I need to think about a way to integrate this functionality without moving spaczz's API too far away from spaCy's.

In the short term there is a way to do this already using ent_ids in the SpaczzRuler. If you repeat the pattern in the optional id field this will be assigned to the ent_id_ attribute. See the example below:

nlp = spacy.blank("en")
spaczz_ruler = nlp.add_pipe("spaczz_ruler")  # spaCy v3 syntax
spaczz_ruler.add_patterns(
    [
        {"label": "COUNTRY", "pattern": "Ireland", "type": "fuzzy", "id": "Ireland"},
        {"label": "COUNTRY", "pattern": "Iceland", "type": "fuzzy", "id": "Iceland"},
    ]
)

doc = nlp("This is a test that should find Iceland")
print([(f"Pattern: {ent.ent_id_}", f"Match: {ent.text}") for ent in doc.ents])
[('Pattern: Iceland', 'Match: Iceland')]

The caveats to this workaround are:

  1. It's a little hacky.
  2. It means you can't use the id field for other purposes.
  3. Token patterns need to be written as strings, e.x. "[{"TEXT": {"FUZZY": "Iceland"}}]".

Hopefully the above work-around suffices for you for the time being. I will see if I can think of a simpler way to integrate this feature.

gandersen101 avatar Mar 20 '21 01:03 gandersen101

This solution behaves a little strangely depending on whether you use a blank spacy model or one of the default english ones. When applying the solution to a blank en model the solution works fine but when applying it to en_core_web_sm it gives empty strings under the ID field sometimes. Is there a way to address this?

wTaylorBickelmann avatar Aug 12 '21 19:08 wTaylorBickelmann

Screen Shot 2021-08-12 at 3 03 36 PM

wTaylorBickelmann avatar Aug 12 '21 19:08 wTaylorBickelmann

Hi @wTaylorBickelmann. I believe what is happening when you have the en_core_web_sm model in the pipeline then add the SpaczzRuler, is that the ruler is being added to the end of the pipeline after the models NER model component. The NER is tagging “Iceland” as a “GPE” and when it gets the to SpaczzRuler it has already been tagged by the NER model so it gets skipped over by the ruler. If you put the ruler before NER in the pipeline you should get the expected results.

gandersen101 avatar Aug 15 '21 21:08 gandersen101

Closed by #81. spaczz v0.6 now returns original patterns.

gandersen101 avatar May 01 '23 12:05 gandersen101