spaczz
spaczz copied to clipboard
Get original matched pattern back
Hi, A very useful feature would be to have the original pattern matched by SpaczzRuler, because when similar patterns are added, there may be doubts about which one is the original pattern matched. I guess this issue connects to a potential link with spacy knowledge base id's. Thank you
Hi @Lostincodes, I've been in a busy stretch at work so thanks for your patience.
So this currently is not a part of spaczz because it isn't part of spaCy as far as I know. That being said it isn't as important for spaCy because generally the matches with spaCy components exactly match their patterns, which obviously is not the case for spaczz.
I've gotten this request a couple times now so I need to think about a way to integrate this functionality without moving spaczz's API too far away from spaCy's.
In the short term there is a way to do this already using ent_ids in the SpaczzRuler. If you repeat the pattern in the optional id
field this will be assigned to the ent_id_
attribute. See the example below:
nlp = spacy.blank("en")
spaczz_ruler = nlp.add_pipe("spaczz_ruler") # spaCy v3 syntax
spaczz_ruler.add_patterns(
[
{"label": "COUNTRY", "pattern": "Ireland", "type": "fuzzy", "id": "Ireland"},
{"label": "COUNTRY", "pattern": "Iceland", "type": "fuzzy", "id": "Iceland"},
]
)
doc = nlp("This is a test that should find Iceland")
print([(f"Pattern: {ent.ent_id_}", f"Match: {ent.text}") for ent in doc.ents])
[('Pattern: Iceland', 'Match: Iceland')]
The caveats to this workaround are:
- It's a little hacky.
- It means you can't use the
id
field for other purposes. - Token patterns need to be written as strings, e.x.
"[{"TEXT": {"FUZZY": "Iceland"}}]"
.
Hopefully the above work-around suffices for you for the time being. I will see if I can think of a simpler way to integrate this feature.
This solution behaves a little strangely depending on whether you use a blank spacy model or one of the default english ones. When applying the solution to a blank en model the solution works fine but when applying it to en_core_web_sm
it gives empty strings under the ID field sometimes. Is there a way to address this?
![Screen Shot 2021-08-12 at 3 03 36 PM](https://user-images.githubusercontent.com/57469687/129254494-528d3b1a-41d2-4556-838f-468905d6a45e.png)
Hi @wTaylorBickelmann. I believe what is happening when you have the en_core_web_sm
model in the pipeline then add the SpaczzRuler
, is that the ruler is being added to the end of the pipeline after the models NER model component. The NER is tagging “Iceland”
as a “GPE”
and when it gets the to SpaczzRuler
it has already been tagged by the NER model so it gets skipped over by the ruler. If you put the ruler before NER in the pipeline you should get the expected results.
Closed by #81. spaczz
v0.6 now returns original patterns.