presidio
presidio copied to clipboard
Merge two entities from the same type with whitespace between them
Currently, Presidio does not merge adjacent results having the same entity type. This causes the output to have multiple placeholders when replacing.
For example: My name is Dave Jones
could with some NER models produce My name is <PERSON> <PERSON>
.
It is better if the output would be My name is <PERSON>
, especially when replacing tokens with fake values or using OpenAI for fake data generation.
Hey, there. I have a few clarifications.
- I tried this with the
en_core_web_lg
model and the output returned a single<PERSON>
entity for "Dave Jones". Are you saying that this might not be the case with other models? - Also, how would we find if 2 adjacent
<PERSON>
entities belong to the same name? If the sentence has 2 different (adjacent) names (say separated by a comma), then replacing those adjacent<PERSON>
entities with a single<PERSON>
entity might be incorrect, right?
Hi, great clarifications. Yes, with some models this could potentially output two <PERSON> values. See the output of the default text on our demo as an example: https://huggingface.co/spaces/presidio/presidio_demo
Regarding your second point, that's a good point. Perhaps this is more complex than what I initially thought. We could potentially merge only if there's a whitespace, but that would also have some false positives.
Oh okay. So how do we proceed further?
We can either handle this on the analyzer side (just before returning the results, add a component that merges), or on the anonymizer side, as part of the already existing logic for handling overlaps. I tend towards the anonymizer (for example to allow users to create other logic for merging if needed, but still get the raw output from the analyzer). Interested in doing a PR?
Sure, yes. I am very interested. I would need some time to familiarize myself with the codebase, though :sweat_smile:. Is that okay?
Absolutely 🙂
I was wondering if there were any updates / movement on this, or identified areas in the code where this needs to be changed. I notice that at least swapping in the "deid_roberta_i2d2" model, contiguous entities of the same type are not elided.
@argideritzalpea this is in the main branch but not yet release to PyPI. We plan to release in the next few weeks. In the mean time, you can install from source.
If we want to ensure that the bounding boxes are consolidated as well in presidio-image-redactor, where would a change need to apply? I tried implementing @gokullan ’s change to PresidioAnalyzer instead of the anonymizer, but the bounding boxes in the image using presidio-image-redactor after redaction do not change to reflect an elision of entities.
Cc @niwilso who might have an idea here.
@argideritzalpea - If multiple words are identified as one entity to redact from the text analyzer, then I believe that should also be automatically carried forward in the image redactor.
If not, then I would suggest adding another method to the ImageAnalyzerEngine
. See the analyze
method in https://github.com/microsoft/presidio/blob/main/presidio-image-redactor/presidio_image_redactor/image_analyzer_engine.py and notice how the text detected via OCR is fed into the standard (non-image) analyzer engine.
If we need to specify to merge entities together from the bounding box side, we can write another method to check against the OCR results, analyzer results, and bounding boxes and then merge certain boxes.
I think it would make more sense that the entity merging logic affects the results in the AnalyzerEngine instead of the AnonymizerEngine as it is currently in the code. Then the ImageAnalyzerEngine would have access to the merged entities list regardless of anonymization.
Finally, the logic in the ImageAnalyzerEngine.analyze
method creates bounding boxes per word, not entity. It would be good to have an option to create bounding boxes according to the entity if you so choose, or as a default even. It can be much safer to redact a slightly larger contiguous area so that individual words are not identified. Sizing of bounding boxes would indicate word length, which reduces security of the redaction.