presidio
presidio copied to clipboard
Validate OCR output schema
Is your feature request related to a problem? Please describe.
Multiple points of code in the presidio-image-redactor
module rely on the OCR results fitting a certain format.
While this works as expected when using Tesseract OCR, there may be issues caused when using a custom OCR class if the output of YourOCRClassHere.perform_ocr()
does not fit the schema of TesseractOCR.perform_ocr()
. In particular, errors could occur in ImageAnalyzerEngine.analyze()
, which is called in a few other classes.
Describe the solution you'd like
Add some sort of schema check inside the ocr
base class that ensures the output of .perform_ocr
matches the output format from TesseractOCR.perform_ocr
.
Describe alternatives you've considered
Consider also making the required output schema of .perform_ocr
clear in the docstrings and documentation.
Additional context n/a