invoice2data icon indicating copy to clipboard operation
invoice2data copied to clipboard

image to data

Open aditya11ad opened this issue 1 year ago • 2 comments

hi,

is there any way to pass "invoice image". I mean can we apply tesseract ocr on the image first and then invoice2data.

thanks in advance.

aditya11ad avatar Sep 12 '22 07:09 aditya11ad

Ran into the same problem. Some invoices have crucial data encaptured in images.

This library had either "text extraction" or "ocr".

What you are looking for is propably an HOCR function. Which applies OCR to the images and extracts the texts from the PDF.

This library supports pdfminer. Recently pdfminer got hocr support.

pdfminer/pdfminer.six#651

Didn't have time yet to play with that functionality, and how it can be used in this library. Maybe we need to update documentation how to use it.

bosd avatar Sep 14 '22 15:09 bosd

Sorry, my previous statement was wrong. Best method for now, is to pass the whole invoice to the tesseract input method.

bosd avatar Sep 25 '22 14:09 bosd

@aditya11ad I've made a pr, in which ocrmypdf is used as a pre-processor for invoice2data. Can you review that one? Check if it covers your use case?

bosd avatar Feb 26 '23 11:02 bosd

Implemented in #409

bosd avatar Jun 19 '23 12:06 bosd