invoice2data
invoice2data copied to clipboard
Add testing for Tesseract module. Including test PDFs or images
Ive been trying out the tesseract ability but i keep getting this particular error. I see it go through page 1 and then page 2 and then it throws the following:
Traceback (most recent call last):
File "/usr/bin/invoice2data", line 11, in <module>
load_entry_point('invoice2data==0.2.81', 'console_scripts', 'invoice2data')()
File "/usr/lib/python3.6/site-packages/invoice2data/main.py", line 117, in main
res = extract_data(f.name, templates=templates, input_module=input_module)
File "/usr/lib/python3.6/site-packages/invoice2data/main.py", line 43, in extract_data
extracted_str = input_module.to_text(invoicefile).decode('utf-8')
AttributeError: 'str' object has no attribute 'decode'
Im not quite sure what the error means in this context. I tried converting the pdf into a tiff myself as defined in the tesseract.py file, and then running tesseract standalone. tesseract creates output as expected, so i believe its something in the application and not tesseract.
Quick update on this; i downloaded the source and removed .decode('utf-8') on line 13 in tesseract.py because it seems that main.py already does this and now it compiles fine. I dont usually work with python but ill put in a PR so you can look at what i did. Im still not sure if its something on my end or with the framework
I had to do the exact same thing, now it works in python 2.7 and Python 3.5
Good find @RobertLemmens . I believe someone only changed this recently. I'll check why this wasn't covered by tests.
Merged your PR. The Tesseract module is not very frequently used and we don't test it yet. Hoping to see some improvements here over the summer.
.decode('utf-8')
was added to fix the problems which occur in Windows.
See #32 and #99
Tesseract module is not working well. There were simple issues such as
$101.00
was being read as $101. 00
and hence says some fields are not matching, therefore not much point in making tests at this point (almost zero usability). I was reading more about tesseract module and they have released v4.0(beta) on 28 March 2018, I hope it works better.
Since the time this issue has been opened there has been a lot of work done on both tesseract and invoice2data. Right now it is possible to get some great usable results.
However the results are not 100% consistent. Maybe we should reconsider this one, and add a non blocking unit test. (if that is possible)
Any pointers how to do that??
Just pinging @rmilecki to bring this issue to your attention.
Sounds good of course (to add more tests). I don't have any space time to work on tesseract though, it's out of my daily usage, sorry.