presidio icon indicating copy to clipboard operation
presidio copied to clipboard

Validate OCR output schema

Open niwilso opened this issue 2 years ago • 0 comments

Is your feature request related to a problem? Please describe. Multiple points of code in the presidio-image-redactor module rely on the OCR results fitting a certain format.

While this works as expected when using Tesseract OCR, there may be issues caused when using a custom OCR class if the output of YourOCRClassHere.perform_ocr() does not fit the schema of TesseractOCR.perform_ocr(). In particular, errors could occur in ImageAnalyzerEngine.analyze(), which is called in a few other classes.

Describe the solution you'd like Add some sort of schema check inside the ocr base class that ensures the output of .perform_ocr matches the output format from TesseractOCR.perform_ocr.

Describe alternatives you've considered Consider also making the required output schema of .perform_ocr clear in the docstrings and documentation.

Additional context n/a

niwilso avatar Jan 19 '23 22:01 niwilso