presidio
presidio copied to clipboard
Reduce analyzer false positives
Is your feature request related to a problem? Please describe.
The AnalyzerEngine
returns False Positive entities (e.g., labeling non-PII text as PII) with fairly high scores (e.g., above 0.7).
Describe the solution you'd like
I believe it may be worthwhile to investigate the scoring algorithm used in the Presidio AnalyzerEngine
.
Describe alternatives you've considered Identify which recognizer results in False Positives (with access to the analyzer results, you can pass return_decision_process to the analyze function, and it would add a json with the recognizer that identified this entity). Then see how this can better inform our investigation to reduce to analyzer False Positives.
Also see Issue #998 and the further context below.
Additional context
When running a False Positives investigation with the image redactor (DicomImagePiiVerifyEngine
), I ran the redactor on images that had text PII present and on images that did not have any text at all. The redactor uses the Presidio ImageAnalyzerEngine
, which uses both a Presidio OCR engine (e.g., TesseractOCR
) and the Presidio AnalyzerEngine
.
When running DicomImagePiiVerifyEngine.eval_dicom_instance()
on all 1693 images in the DICOM de-identification dataset, I saw that 39 images contained False Positive entities. When passing score_threshold=0.7
into the analyzer, we still see 38 images as having False Positives.
The root of the issue is that images without text are having text returned by the OCR (being addressed in Issue #998 ). However, ideally, the analyzer would have understood that the text being passed to it was not PII.
This may be difficult to address though because some of the "text" returned by the OCR are numbers, which inherently are likely to be sensitive. Please see the list of example "text" returned by OCR and determined as PII by the analyzer below (score_threshold=0.7
):
["3", "deg.", "F", "10,0", "6%", "Pees", "hie", "Soa", "ee", "és", "Dain", "wae", "Pam"]