sense2vec icon indicating copy to clipboard operation
sense2vec copied to clipboard

Is there any way to use "doc.spans" in 01_parse.py?

Open nonstoprunning opened this issue 4 years ago • 0 comments

Hi, I am trying to built a sense2vec model with new data. I have made few changes in 01_parse.py. First, I have removed the default ner pipe coming with "en_core_web_lg". Then I have added a new Language.component where I identify Spans associated to a new entities (new labels) in a doc. Sometimes, I would like to assign a Span[x, y] to more than one entity but I can not. My question... I have read the new changes in spaCy v3.1. Is there a way to use "doc.spans" (or something similar) in 01_parse where SpaCy's internal algorithms take Spans overlap into account?

@Language.component("name_comp") def my_component(doc):
matches = matcher(doc) seen_tokens = set() new_entities = [] entities = doc.ents for match_id, start, end in matches: # check for end - 1 here because boundaries are inclusive if start not in seen_tokens and end - 1 not in seen_tokens: new_entities.append(Span(doc, start, end, label=match_id)) entities = [ e for e in entities if not (e.start < end and e.end > start) ] seen_tokens.update(range(start, end)) doc.ents = tuple(entities) + tuple(new_entities) return doc

Thanks in advance, Paula

nonstoprunning avatar Aug 04 '21 15:08 nonstoprunning