invoice2data
invoice2data copied to clipboard
image to data
hi,
is there any way to pass "invoice image". I mean can we apply tesseract ocr on the image first and then invoice2data.
thanks in advance.
Ran into the same problem. Some invoices have crucial data encaptured in images.
This library had either "text extraction" or "ocr".
What you are looking for is propably an HOCR function. Which applies OCR to the images and extracts the texts from the PDF.
This library supports pdfminer. Recently pdfminer got hocr support.
pdfminer/pdfminer.six#651
Didn't have time yet to play with that functionality, and how it can be used in this library. Maybe we need to update documentation how to use it.
Sorry, my previous statement was wrong. Best method for now, is to pass the whole invoice to the tesseract input method.
@aditya11ad I've made a pr, in which ocrmypdf is used as a pre-processor for invoice2data. Can you review that one? Check if it covers your use case?
Implemented in #409