presidio icon indicating copy to clipboard operation
presidio copied to clipboard

Same Analyzer detects entity in text but not in image

Open NuiMrme opened this issue 1 year ago • 4 comments

Describe the bug same Analyzer detects LOCATION entity token in text but fails to detect the same token in an image

To Reproduce

analyzer=AnalyzerEngine(nlp_engine=nlp_engine_with_french,
    log_decision_process="true",supported_languages = ["fr","en"])

print(analyzer.analyze(text='VALENCE', language ="en"))

ImageAnalyzer = ImageAnalyzerEngine(analyzer_engine = analyzer)
engine = ImageRedactorEngine(image_analyzer_engine = ImageAnalyzer)

Expected behavior VALENCE is detected as location, even if I change the language, the text, lower-case etc... it is detected as LOCATION. If I use the same Analyzer to create an ImageAnalyzer, VALENCE should be detected as LOCATION if the word is there in the image.

NuiMrme avatar Feb 28 '24 10:02 NuiMrme

Could it be the the OCR engine doesn't recognizer this text? Have you tried running tesseract on it to see the output?

omri374 avatar Feb 28 '24 14:02 omri374

since I have the log_decision_process to "true", the word "VALENCE" is there in the log

Edit: That being said, I created an empty image with just the word "VALENCE" on it, and it was detected as LOCATION. Does the detection depends on the words before and after ??

NuiMrme avatar Feb 28 '24 14:02 NuiMrme

Yes, location is detected using a named entity recognition model. context words could certainly change the output. If you have a finite list of locations, you can create a deny list and pass it to the analyzer engine.

omri374 avatar Feb 29 '24 15:02 omri374

Yes, location is detected using a named entity recognition model. context words could certainly change the output. If you have a finite list of locations, you can create a deny list and pass it to the analyzer engine.

Thanks for your reply. I actually have that already in my code but I still don't detect this one location. I think the problem is deeper, something about that exact document makes it problematic. In my deny list though I have the locations first letter Capital, in the document it is written all in CAPITAL letters, not sure if this is a problem.

NuiMrme avatar Feb 29 '24 16:02 NuiMrme