holmes-extractor
holmes-extractor copied to clipboard
Guidance requested on using doc.retokenize
I'm having a great time with this library, but running into the following issue:
As mentioned in https://github.com/explosion/holmes-extractor/issues/2#issuecomment-1195536077, I am manually setting entities using doc.set_ents
. I found that some obvious / verbatim search phrases (e.g. something like ENTITYCUSTOM_ENT1 something ENTITYCUSTOM_ENT2
) would not match properly until I made sure to retokenize the doc before registering it with the manager:
docs = []
spans = filter_spans(spans=spans)
doc.set_ents(spans, default='unmodified')
with doc.retokenize() as retokenizer:
for span in spans:
retokenizer.merge(span, {'POS': 'NOUN'})
docs.append(doc)
# ...
for doc in docs:
manager.register_serialized_document(doc.to_bytes(), label=label)
However, this also appears to have created a mis-alignment in doc length and I am not sure how. Some search phrases end up throwing this error when a match is found:
IndexError: [E026] Error accessing token at position 145: out of bounds in Doc of length 88.
I was able to debug that particular doc, and found that my retokenized doc is 88 (word) tokens. If I ran the doc.text
through manager.nlp
, however, the resulting doc is 145 word tokens. This suggests to me that somewhere the merged span information is lost, perhaps after some internal call to manager.nlp
?