presidio
presidio copied to clipboard
Adding QR codes support in the ImageRedactorEngine
Change Description
This PR adds to the Presidio Image Redactor the ability to analyze the content of QR codes on the image.
Summary of Changes
- Added abstract class
QRRecognizer
for QR code recognizers - Added concrete
OpenCVQRRecongnizer
which uses OpenCV to recognize QR codes - Added
QRImageAnalyzerEngine
which usesQRRecognizer
for QR code recognition andAnalyzerEngine
to analyze its contents for PII entities - Modified
ImagePiiVerifyEngine
andImageRedactorEngine
to allow usingQRImageAnalyzerEngine
as an alternative toImageAnalyzerEngine
Issue reference
This PR fixes issue #1035
Checklist
- [x] I have reviewed the contribution guidelines
- [x] I have signed the CLA (if required)
- [x] My code includes unit tests
- [x] All unit tests and lint checks pass locally
- [x] My PR contains documentation updates / additions if required
@microsoft-github-policy-service agree
/azp run
Azure Pipelines successfully started running 1 pipeline(s).
@vpvpvpvp Seems like the unit tests are failing in the CI, what version of Tesseract OCR did you use to generate the baseline images for the tests?
@vpvpvpvp Seems like the unit tests are failing in the CI, what version of Tesseract OCR did you use to generate the baseline images for the tests?
Tesseract OCR - 5.2.0 pytesseract - 0.3.10 OS - MacOS Ventura 13.2
Indeed, I noticed that in different test environments, the results of ImagePiiVerifyEngine may differ in some pixels. For example, below in the first image is the result on Mac, next on Ubuntu and their difference. At the same time, both the recognized text itself and the box coordinates are the same.
Hi @vpvpvpvp, before going deeper into the code, what are your thoughts of having the QR code analyzer working potentially in parallel to OCR? something like that:
stateDiagram-v2
read_image
read_image --> extract_ocr_text
read_image --> extract_qr_text
extract_ocr_text --> presidio_analyzer
extract_qr_text --> presidio_analyzer
presidio_analyzer --> redact_image
redact_image --> return_image
Then we could always extend it to more types of detectors in the future, similar to the text analyzer architecture, e.g.:
stateDiagram-v2
read_image
read_image --> extract_ocr_text
read_image --> extract_qr_text
read_image --> extract_faces
read_image --> extract_license_plates
extract_ocr_text --> presidio_analyzer
extract_qr_text --> presidio_analyzer
presidio_analyzer --> redact_image
extract_faces --> redact_image
extract_license_plates --> redact_image
redact_image --> return_image
One way to achieve this is to have QRImageAnalyzerEngine
extend ImageAnalyzerEngine
, and then we could later create a composable ImageAnalyzerEngine
which holds multiple image analyzers.
WDYT?
Hi @omri374, that sounds great! In the current PR you can choose between QRImageAnalyzerEngine
and ImageAnalyzerEngine
, but it would be great to be able to run them and other analyzers in parallel. At first, I wanted to extend ImageAnalyzerEngine
a bit, so that it would also accept QRRecognizer
as a parameter in addition to OCR. Something like this:
class ImageAnalyzerEngine:
"""ImageAnalyzerEngine class.
:param analyzer_engine: The Presidio AnalyzerEngine instance
to be used to detect PII in text
:param ocr: the OCR object to be used to detect text in images.
:param qr: the QRRecognizer object to detect and decode text in QR codes
"""
def __init__(
self,
analyzer_engine: Optional[AnalyzerEngine] = None,
ocr: Optional[OCR] = None,
qr: Optional[QRRecognizer] = None,
):
And then, in the analyze
function, extract the text and its coordinates first with self.ocr
and then with self.qr
. But then I decided not to change ImageAnalyzerEngine
this time, to make less edits to the original source code.
@vpvpvpvp Seems like the unit tests are failing in the CI, what version of Tesseract OCR did you use to generate the baseline images for the tests?
Tesseract OCR - 5.2.0 pytesseract - 0.3.10 OS - MacOS Ventura 13.2
Indeed, I noticed that in different test environments, the results of ImagePiiVerifyEngine may differ in some pixels. For example, below in the first image is the result on Mac, next on Ubuntu and their difference. At the same time, both the recognized text itself and the box coordinates are the same.
I would suggest to use the original image as baseline (not a screenshot of it or of the screen). If its still failing, lets see how to add thresholding to the comparison
@vpvpvpvp Seems like the unit tests are failing in the CI, what version of Tesseract OCR did you use to generate the baseline images for the tests?
Tesseract OCR - 5.2.0 pytesseract - 0.3.10 OS - MacOS Ventura 13.2 Indeed, I noticed that in different test environments, the results of ImagePiiVerifyEngine may differ in some pixels. For example, below in the first image is the result on Mac, next on Ubuntu and their difference. At the same time, both the recognized text itself and the box coordinates are the same.
I would suggest to use the original image as baseline (not a screenshot of it or of the screen). If its still failing, lets see how to add thresholding to the comparison
Updated the test images, locally the tests passed.
/azp run
Azure Pipelines successfully started running 1 pipeline(s).
@vpvpvpvp Seems to work now, but failing on another test that should be resolved in https://github.com/microsoft/presidio/pull/1032 , try to rebase after once merged
/azp run
Azure Pipelines successfully started running 1 pipeline(s).
/azp run
Azure Pipelines successfully started running 1 pipeline(s).
/azp run
Azure Pipelines successfully started running 1 pipeline(s).
/azp run
Azure Pipelines successfully started running 1 pipeline(s).
/azp run
Azure Pipelines successfully started running 1 pipeline(s).
@vpvpvpvp you have a green build 🎊 Will try to review the code later today