invoice2data
invoice2data copied to clipboard
using tesseract4 option when use as a python library?
How can i use tesseract option when use inovice2data as a library?
Hey @erkin98,
Haven't tried anything except pdftotext, but you should be able to specify it as the third argument to the extract_data function which is defined as:
extract_data(invoicefile, templates=None, input_module=pdftotext)
Docstring shows:
input_module : {'pdftotext', 'pdfminer', 'tesseract'}, optional
library to be used to extract text from given `invoicefile`
I've tried and it does not work. Anyone managed to use it?
I use tesseract
as an inputmodule in:
https://github.com/OCA/edi/pull/567
Tried tesseract4 from the commandline but got an error:
convert-im6.q16: unable to open image `/tmp/tmp_a3u7owr.tiff': No such file or directory @ error/blob.c/OpenBlob/2874.
convert-im6.q16: no images defined `tiff:-' @ error/convert.c/ConvertImageCommand/3258.
Note: the title of this issue mentions tesseract4
, but the post mentions tesseract
@erkin98 @Carlos314159 New PR in: https://github.com/OCA/edi/pull/722 to fix/restore this functionality. Would you be so kind to review? :pray:
Edit: Oops, was not paying attention. Thought I was posting this in the odoo repo :astonished:
Anyway, recently refactoring has be done on the tesseract input module. Tesseract 4 is now the default. Languages are automatically detected.
It can now be used on image files. Currently there is one issue with parsing pdf files. But hotfix is on it's way in https://github.com/invoice-x/invoice2data/pull/468
Alhough my previous post was in the wrong repo. It might be usefull for some of you as an code example.
You can see a live test at that pr:
- on gh actions (click on show all checks), --> click on runboat/build (details) .
- Click on start (wait a couple of mins, background turns green)
- Click on Live
- Login with user: admin password: admin
- Go to invoicing-->vendors--> Import vendor bills
Closing this one as completed. feel free to reopen.