invoice2data icon indicating copy to clipboard operation
invoice2data copied to clipboard

autodetect pdf type

Open gregoribic opened this issue 3 years ago • 4 comments

Is there already a solution to check/detect if the pdf is searchable (pdftotext) or it is an image (ocr, tesseract) and use appropriate method for text extraction.

gregoribic avatar May 10 '21 10:05 gregoribic

It can simply be done by using if-else conditions. You can put a condition and check if pdf can be extracted by using pdftotext, and if result is False then the else condition will try it again with OCR(tesseract)

result = extract_data(filename,templates=templates) if not result: result = extract_data(filename, templates=templates, input_module=tesseract)

nayyhah avatar Jan 23 '22 12:01 nayyhah

Hello, I'm trying to apply what you are saying, but I'm getting the following error: "NameError: name 'tesseract' is not defined"

It also happens when I fill the input_module with "pdftotext" and the other ones. Invoice2data is working good for me with normal PDFs, but in this case I'm trying to process a scanned pdf, that's why I need to specify tesseract as input_module.

Hope you can help me.

manuel-barreiro avatar Apr 08 '22 18:04 manuel-barreiro

Could be, but how to handle corner cases? I've got a couple of invoices. Where they put the company info in the image header of the invoice.

The invoice line part is the same. Branch A --> Shows header image with Branch A business Info Branch B --> Shows header image with Branch B business Info

(Or another company who issues invoices with their company info as flat image, and the rest of the invoice as text.)

bosd avatar Aug 26 '22 07:08 bosd

Previously there was a function in invoice2data which was checking the PDF output. It was something like. If the output is less then 80 characters, then fallback on Tesseract to OCR the PDF. It was removed because of stability issues??

Maybe this is not needed to be solved in invoice2data. As you pdfminer support hOCR now. https://github.com/pdfminer/pdfminer.six/pull/651

Maybe we need to update documentation how to use it.

bosd avatar Aug 26 '22 07:08 bosd