invoice2data image to data

image to data

Open aditya11ad opened this issue 1 year ago • 2 comments

hi,

is there any way to pass "invoice image". I mean can we apply tesseract ocr on the image first and then invoice2data.

thanks in advance.

Sep 12 '22 07:09 aditya11ad

Ran into the same problem. Some invoices have crucial data encaptured in images.

This library had either "text extraction" or "ocr".

What you are looking for is propably an HOCR function. Which applies OCR to the images and extracts the texts from the PDF.

This library supports pdfminer. Recently pdfminer got hocr support.

pdfminer/pdfminer.six#651

Didn't have time yet to play with that functionality, and how it can be used in this library. Maybe we need to update documentation how to use it.

Sep 14 '22 15:09 bosd

Sorry, my previous statement was wrong. Best method for now, is to pass the whole invoice to the tesseract input method.

Sep 25 '22 14:09 bosd

@aditya11ad I've made a pr, in which ocrmypdf is used as a pre-processor for invoice2data. Can you review that one? Check if it covers your use case?

Feb 26 '23 11:02 bosd

Implemented in #409

Jun 19 '23 12:06 bosd