presidio icon indicating copy to clipboard operation
presidio copied to clipboard

Merge two entities from the same type with whitespace between them

Open omri374 opened this issue 1 year ago • 12 comments

Currently, Presidio does not merge adjacent results having the same entity type. This causes the output to have multiple placeholders when replacing.

For example: My name is Dave Jones could with some NER models produce My name is <PERSON> <PERSON>. It is better if the output would be My name is <PERSON>, especially when replacing tokens with fake values or using OpenAI for fake data generation.

omri374 avatar Jun 13 '23 11:06 omri374

Hey, there. I have a few clarifications.

  • I tried this with the en_core_web_lg model and the output returned a single <PERSON> entity for "Dave Jones". Are you saying that this might not be the case with other models?
  • Also, how would we find if 2 adjacent <PERSON> entities belong to the same name? If the sentence has 2 different (adjacent) names (say separated by a comma), then replacing those adjacent <PERSON> entities with a single <PERSON> entity might be incorrect, right?

gokullan avatar Jun 13 '23 13:06 gokullan

Hi, great clarifications. Yes, with some models this could potentially output two <PERSON> values. See the output of the default text on our demo as an example: https://huggingface.co/spaces/presidio/presidio_demo

Regarding your second point, that's a good point. Perhaps this is more complex than what I initially thought. We could potentially merge only if there's a whitespace, but that would also have some false positives.

omri374 avatar Jun 13 '23 15:06 omri374

Oh okay. So how do we proceed further?

gokullan avatar Jun 14 '23 06:06 gokullan

We can either handle this on the analyzer side (just before returning the results, add a component that merges), or on the anonymizer side, as part of the already existing logic for handling overlaps. I tend towards the anonymizer (for example to allow users to create other logic for merging if needed, but still get the raw output from the analyzer). Interested in doing a PR?

omri374 avatar Jun 14 '23 07:06 omri374

Sure, yes. I am very interested. I would need some time to familiarize myself with the codebase, though :sweat_smile:. Is that okay?

gokullan avatar Jun 14 '23 14:06 gokullan

Absolutely 🙂

omri374 avatar Jun 14 '23 16:06 omri374

I was wondering if there were any updates / movement on this, or identified areas in the code where this needs to be changed. I notice that at least swapping in the "deid_roberta_i2d2" model, contiguous entities of the same type are not elided.

argideritzalpea avatar Oct 10 '23 18:10 argideritzalpea

@argideritzalpea this is in the main branch but not yet release to PyPI. We plan to release in the next few weeks. In the mean time, you can install from source.

omri374 avatar Oct 11 '23 09:10 omri374

If we want to ensure that the bounding boxes are consolidated as well in presidio-image-redactor, where would a change need to apply? I tried implementing @gokullan ’s change to PresidioAnalyzer instead of the anonymizer, but the bounding boxes in the image using presidio-image-redactor after redaction do not change to reflect an elision of entities.

argideritzalpea avatar Oct 18 '23 17:10 argideritzalpea

Cc @niwilso who might have an idea here.

omri374 avatar Oct 18 '23 18:10 omri374

@argideritzalpea - If multiple words are identified as one entity to redact from the text analyzer, then I believe that should also be automatically carried forward in the image redactor.

If not, then I would suggest adding another method to the ImageAnalyzerEngine. See the analyze method in https://github.com/microsoft/presidio/blob/main/presidio-image-redactor/presidio_image_redactor/image_analyzer_engine.py and notice how the text detected via OCR is fed into the standard (non-image) analyzer engine.

If we need to specify to merge entities together from the bounding box side, we can write another method to check against the OCR results, analyzer results, and bounding boxes and then merge certain boxes.

niwilso avatar Oct 18 '23 19:10 niwilso

I think it would make more sense that the entity merging logic affects the results in the AnalyzerEngine instead of the AnonymizerEngine as it is currently in the code. Then the ImageAnalyzerEngine would have access to the merged entities list regardless of anonymization. Finally, the logic in the ImageAnalyzerEngine.analyze method creates bounding boxes per word, not entity. It would be good to have an option to create bounding boxes according to the entity if you so choose, or as a default even. It can be much safer to redact a slightly larger contiguous area so that individual words are not identified. Sizing of bounding boxes would indicate word length, which reduces security of the redaction.

argideritzalpea avatar Oct 18 '23 21:10 argideritzalpea