presidio Enhancing the decision process text when working with images

Is your feature request related to a problem? Please describe. The decision process output prints out the entity_type, start_position, end_position and the score. When working with longer sequences of texts or with images, printing start = 204 end = 217 doesn't really mean anything and it is hard to see where that is.

Describe the solution you'd like Add an entity_text where the the text in question is also printed: printing start = 204 end = 217 entity_text = "Saint Antonio"

I solved this on my version by adding

entity_text: str,

in recognizer_result.py init function which then affected also image_analzer_engine.py, image_recognizer_results.py, spacy_recognizer.py and pattern_recognizer.py but the output is rather more readable

Apr 22 '24 08:04 NuiMrme

While at it, in analyzer_engine.py line:222 I modified the line so that the code prints out every case in a new line , even more readable json.dumps([str(result.to_dict()) for result in results], indent=2),

Apr 23 '24 13:04 NuiMrme

@NuiMrme are you asking specifically for images, or for any text?

Apr 23 '24 13:04 omri374

Does this help? https://github.com/microsoft/presidio/discussions/925#discussioncomment-3781214

Apr 23 '24 13:04 omri374

Sorry that wasn't well explained. I'm not reporting a bug but rather a feature I implemented on my version of Presidio that might help others too. See when you work with images or a lot of text while having your log_decision_process=True , the printed text will be for many many instances where it detected something and the log becomes unreadable. Please remember it prints that automatically no explicit print command is used as in your shared comment above. If I have one line example thats fine I can look quickly see what these position refer to but when you have many of these stacked together because it is coming from an image of a document , you don't know anymore what is what. So I did the above mentioned modifications to change it a bit to make it more readable

Every new case will begin in a new line and observe that there is now a 'entity_text' which will show that text that is detected (I covered it with red for the obvious reasons), now you don't have to guess what line was that in the image what position etc... This is more readable and help the anlaysis of the annomyization results.

before

after

Apr 23 '24 14:04 NuiMrme

One of the reasons we intentionally left out the actual identified text, is because it is essentially PII you might not want to log or return. If you have a suggestion on how to allow this, perhaps not asa default setting, we'd be happy to hear.

I totally agree that there are cases, especially with the images module, where returning or logging the actual text is needed.

Apr 24 '24 13:04 omri374

One of the reasons we intentionally left out the actual identified text, is because it is essentially PII you might not want to log or return. If you have a suggestion on how to allow this, perhaps not asa default setting, we'd be happy to hear.

I totally agree that there are cases, especially with the images module, where returning or logging the actual text is needed.

Well they are already printed out in the beginning anyway [2024-04-24 12:46:05,853][decision_process][INFO][None][nlp artifacts:{"entities": ["Travaux", "Forage D'Eau Du", ...

Apr 25 '24 10:04 NuiMrme

Good catch. I guess that for return_decision_process=True, it makes sense to be more verbose and return the actual values, but for the production version (where return_decision_process is likely disabled), it makes sense to omit it. Would you be interested in proposing a change through a pull request?

Apr 25 '24 13:04 omri374

Absolutely

Apr 25 '24 13:04 NuiMrme