medspacy icon indicating copy to clipboard operation
medspacy copied to clipboard

FIXING span matching warning error

Open felipemello1 opened this issue 3 years ago • 0 comments

In medspacy, if a token matches your rule, but it doesn’t match the entity present in the document, it will fail to update the document with the rule result, and your pipeline will fail. For example: If you have the entity “UNITED STATES”, and you try to update your document with labels for “UNITED” and “STATES” separately, the entity “UNITED STATES” wil not be updated by your rule. This can be fixed by improving your rule with “OP”: “+”, but you won’t fix it if you don’t know that it is broken in the first place.

In medspacy 1.0.0, if it fails, it raises an error: check the code, and makes the whole pipeline fail. In medpsacy 0.2.0, it ignores the error, and SILENTLY, doesn’t register the label: check the code.

code in medpsacy 0.2.0

    if self.add_ents is True:
        for span in spans:
            try:
                doc.ents += (span,)
            # spaCy will raise a value error if the token in span are already
            # part of an entity (ie., as part of an upstream component
            # In that case, let the existing span supercede this one
            except ValueError as e:
                # raise e
                pass
        return doc
    else:
        return spans

A simple solution is to set self.add_ents = False and update the document outside the function. Then, if TRY fails, instead of ignoring the exception or raising it, we can just log it without stopping the whole pipeline.

def extract(self, doc: Doc):
    
    self.matcher.add_ents = False
    spans = self.matcher(doc)        

    for span in spans:
        try:
            doc.ents += (span,)
        # spaCy will raise a value error if the token in span are already
        # part of an entity (ie., as part of an upstream component
        # In that case, let the existing span supercede this one
        except ValueError as e:
            # raise e
            print(e)

    # returns a list of entities which will be a merge of:
    # already-found entities (via NER)
    # new entities created from the patterns
    return doc.ents
    

Related issues: https://github.com/medspacy/medspacy/issues/182

felipemello1 avatar Dec 20 '22 18:12 felipemello1