presidio Adding QR codes support in the ImageRedactorEngine

Change Description

This PR adds to the Presidio Image Redactor the ability to analyze the content of QR codes on the image.

qr-image-redactor-design

Summary of Changes

Added abstract class QRRecognizer for QR code recognizers
Added concrete OpenCVQRRecongnizer which uses OpenCV to recognize QR codes
Added QRImageAnalyzerEngine which uses QRRecognizer for QR code recognition and AnalyzerEngine to analyze its contents for PII entities
Modified ImagePiiVerifyEngine and ImageRedactorEngine to allow using QRImageAnalyzerEngine as an alternative to ImageAnalyzerEngine

Issue reference

This PR fixes issue #1035

Checklist

[x] I have reviewed the contribution guidelines
[x] I have signed the CLA (if required)
[x] My code includes unit tests
[x] All unit tests and lint checks pass locally
[x] My PR contains documentation updates / additions if required

Feb 17 '23 18:02 vpvpvpvp

@microsoft-github-policy-service agree

Feb 17 '23 18:02 vpvpvpvp

/azp run

Feb 19 '23 18:02 SharonHart

Azure Pipelines successfully started running 1 pipeline(s).

Feb 19 '23 18:02 azure-pipelines[bot]

@vpvpvpvp Seems like the unit tests are failing in the CI, what version of Tesseract OCR did you use to generate the baseline images for the tests?

Feb 20 '23 12:02 SharonHart

@vpvpvpvp Seems like the unit tests are failing in the CI, what version of Tesseract OCR did you use to generate the baseline images for the tests?

Tesseract OCR - 5.2.0 pytesseract - 0.3.10 OS - MacOS Ventura 13.2

Indeed, I noticed that in different test environments, the results of ImagePiiVerifyEngine may differ in some pixels. For example, below in the first image is the result on Mac, next on Ubuntu and their difference. At the same time, both the recognized text itself and the box coordinates are the same. mac_ubuntu_diff

Feb 20 '23 17:02 vpvpvpvp

Hi @vpvpvpvp, before going deeper into the code, what are your thoughts of having the QR code analyzer working potentially in parallel to OCR? something like that:

stateDiagram-v2
    read_image
    read_image --> extract_ocr_text
    read_image --> extract_qr_text
    extract_ocr_text --> presidio_analyzer
    extract_qr_text --> presidio_analyzer
    presidio_analyzer --> redact_image
    redact_image --> return_image

Then we could always extend it to more types of detectors in the future, similar to the text analyzer architecture, e.g.:

stateDiagram-v2
    read_image
    read_image --> extract_ocr_text
    read_image --> extract_qr_text
    read_image --> extract_faces
    read_image --> extract_license_plates
    extract_ocr_text --> presidio_analyzer
    extract_qr_text --> presidio_analyzer
    presidio_analyzer --> redact_image
    extract_faces --> redact_image
    extract_license_plates --> redact_image
    redact_image --> return_image

One way to achieve this is to have QRImageAnalyzerEngine extend ImageAnalyzerEngine, and then we could later create a composable ImageAnalyzerEngine which holds multiple image analyzers. WDYT?

Feb 20 '23 19:02 omri374

Hi @omri374, that sounds great! In the current PR you can choose between QRImageAnalyzerEngine and ImageAnalyzerEngine, but it would be great to be able to run them and other analyzers in parallel. At first, I wanted to extend ImageAnalyzerEngine a bit, so that it would also accept QRRecognizer as a parameter in addition to OCR. Something like this:

class ImageAnalyzerEngine:
    """ImageAnalyzerEngine class.

    :param analyzer_engine: The Presidio AnalyzerEngine instance
        to be used to detect PII in text
    :param ocr: the OCR object to be used to detect text in images.
    :param qr: the QRRecognizer object to detect and decode text in QR codes
    """
    def __init__(
        self,
        analyzer_engine: Optional[AnalyzerEngine] = None,
        ocr: Optional[OCR] = None,
        qr: Optional[QRRecognizer] = None,
    ):

And then, in the analyze function, extract the text and its coordinates first with self.ocr and then with self.qr. But then I decided not to change ImageAnalyzerEngine this time, to make less edits to the original source code.

Feb 21 '23 00:02 vpvpvpvp

@vpvpvpvp Seems like the unit tests are failing in the CI, what version of Tesseract OCR did you use to generate the baseline images for the tests?

Tesseract OCR - 5.2.0 pytesseract - 0.3.10 OS - MacOS Ventura 13.2

Indeed, I noticed that in different test environments, the results of ImagePiiVerifyEngine may differ in some pixels. For example, below in the first image is the result on Mac, next on Ubuntu and their difference. At the same time, both the recognized text itself and the box coordinates are the same.

I would suggest to use the original image as baseline (not a screenshot of it or of the screen). If its still failing, lets see how to add thresholding to the comparison

Feb 21 '23 08:02 SharonHart

@vpvpvpvp Seems like the unit tests are failing in the CI, what version of Tesseract OCR did you use to generate the baseline images for the tests?

Tesseract OCR - 5.2.0 pytesseract - 0.3.10 OS - MacOS Ventura 13.2 Indeed, I noticed that in different test environments, the results of ImagePiiVerifyEngine may differ in some pixels. For example, below in the first image is the result on Mac, next on Ubuntu and their difference. At the same time, both the recognized text itself and the box coordinates are the same.

I would suggest to use the original image as baseline (not a screenshot of it or of the screen). If its still failing, lets see how to add thresholding to the comparison

Updated the test images, locally the tests passed.

Feb 21 '23 09:02 vpvpvpvp

/azp run

Feb 21 '23 11:02 SharonHart

Azure Pipelines successfully started running 1 pipeline(s).

Feb 21 '23 11:02 azure-pipelines[bot]

@vpvpvpvp Seems to work now, but failing on another test that should be resolved in https://github.com/microsoft/presidio/pull/1032 , try to rebase after once merged

Feb 21 '23 11:02 SharonHart

/azp run

Feb 28 '23 12:02 omri374

Azure Pipelines successfully started running 1 pipeline(s).

Feb 28 '23 12:02 azure-pipelines[bot]

/azp run

Mar 01 '23 05:03 omri374

Azure Pipelines successfully started running 1 pipeline(s).

Mar 01 '23 05:03 azure-pipelines[bot]

/azp run

Mar 01 '23 07:03 omri374

Azure Pipelines successfully started running 1 pipeline(s).

Mar 01 '23 07:03 azure-pipelines[bot]

/azp run

Mar 01 '23 09:03 SharonHart

Azure Pipelines successfully started running 1 pipeline(s).

Mar 01 '23 09:03 azure-pipelines[bot]

/azp run

Mar 02 '23 07:03 SharonHart

Azure Pipelines successfully started running 1 pipeline(s).

Mar 02 '23 07:03 azure-pipelines[bot]

@vpvpvpvp you have a green build 🎊 Will try to review the code later today

Mar 02 '23 07:03 SharonHart