holmes-extractor icon indicating copy to clipboard operation
holmes-extractor copied to clipboard

Guidance requested on using doc.retokenize

Open adelevie opened this issue 2 years ago • 0 comments

I'm having a great time with this library, but running into the following issue:

As mentioned in https://github.com/explosion/holmes-extractor/issues/2#issuecomment-1195536077, I am manually setting entities using doc.set_ents. I found that some obvious / verbatim search phrases (e.g. something like ENTITYCUSTOM_ENT1 something ENTITYCUSTOM_ENT2) would not match properly until I made sure to retokenize the doc before registering it with the manager:

docs = []
spans = filter_spans(spans=spans)
doc.set_ents(spans, default='unmodified')
with doc.retokenize() as retokenizer:
    for span in spans:
        retokenizer.merge(span, {'POS': 'NOUN'})
docs.append(doc)
# ...
for doc in docs:
    manager.register_serialized_document(doc.to_bytes(), label=label)

However, this also appears to have created a mis-alignment in doc length and I am not sure how. Some search phrases end up throwing this error when a match is found:

IndexError: [E026] Error accessing token at position 145: out of bounds in Doc of length 88.

I was able to debug that particular doc, and found that my retokenized doc is 88 (word) tokens. If I ran the doc.text through manager.nlp, however, the resulting doc is 145 word tokens. This suggests to me that somewhere the merged span information is lost, perhaps after some internal call to manager.nlp?

adelevie avatar Jul 27 '22 16:07 adelevie