presidio icon indicating copy to clipboard operation
presidio copied to clipboard

AttributeError with NlpEngine

Open matthewchung74 opened this issue 3 years ago • 2 comments

Describe the bug When running certain strings through the new TransformersNlpEngine I am getting errors. It doesn't happen on every string.

To Reproduce Steps to reproduce the behavior:

  1. See provided colab : https://colab.research.google.com/drive/1H_kKeHlfvZUSaPN0HNRn1_ymZk_TH4Cp#scrollTo=J63RtBZFaiLv

Expected behavior I would expect it not to error

Additional context i hacked this line with the following : https://github.com/microsoft/presidio/blob/37f74e8e880cb1bdf3f5224a05eaa9b63df02d31/presidio-analyzer/presidio_analyzer/nlp_engine/transformers_nlp_engine.py#L59

            if span is not None:
                span._.confidence_score = d["score"]
                ents.append(span)

and it silences the issue, but I don't understand why. help is appreciated.

matthewchung74 avatar Sep 01 '22 20:09 matthewchung74

Good catch. From spaCy's docs, char_span Returns None if the character indices don’t map to a valid span using the default alignment mode strict. We'll look into this. Perhaps changing the alignment_mode is the preferred solution.

omri374 avatar Sep 04 '22 06:09 omri374

thanks for the response @omri374 . I tried expand and contract in alignment_mode and am seeing exceptions thrown in both cases. I do feel expand is the way to go, but might need some tweaking. I can try to do some more digging later this week.

matthewchung74 avatar Sep 05 '22 04:09 matthewchung74

Fixed in #941

SharonHart avatar Dec 19 '22 10:12 SharonHart

Note that there still could be issues with the alignment, and #941 is not a perfect solution. If you have issues with this, we would recommend to use the TransformersRecognizer instead of the TransformersNlpEngine

omri374 avatar Dec 19 '22 10:12 omri374