invoice2data icon indicating copy to clipboard operation
invoice2data copied to clipboard

Add testing for Tesseract module. Including test PDFs or images

Open RobertLemmens opened this issue 6 years ago • 8 comments

Ive been trying out the tesseract ability but i keep getting this particular error. I see it go through page 1 and then page 2 and then it throws the following:

Traceback (most recent call last):
  File "/usr/bin/invoice2data", line 11, in <module>
    load_entry_point('invoice2data==0.2.81', 'console_scripts', 'invoice2data')()
  File "/usr/lib/python3.6/site-packages/invoice2data/main.py", line 117, in main
    res = extract_data(f.name, templates=templates, input_module=input_module)
  File "/usr/lib/python3.6/site-packages/invoice2data/main.py", line 43, in extract_data
    extracted_str = input_module.to_text(invoicefile).decode('utf-8')
AttributeError: 'str' object has no attribute 'decode'

Im not quite sure what the error means in this context. I tried converting the pdf into a tiff myself as defined in the tesseract.py file, and then running tesseract standalone. tesseract creates output as expected, so i believe its something in the application and not tesseract.

RobertLemmens avatar Apr 18 '18 04:04 RobertLemmens

Quick update on this; i downloaded the source and removed .decode('utf-8') on line 13 in tesseract.py because it seems that main.py already does this and now it compiles fine. I dont usually work with python but ill put in a PR so you can look at what i did. Im still not sure if its something on my end or with the framework

RobertLemmens avatar Apr 18 '18 15:04 RobertLemmens

I had to do the exact same thing, now it works in python 2.7 and Python 3.5

blade3609 avatar Apr 20 '18 15:04 blade3609

Good find @RobertLemmens . I believe someone only changed this recently. I'll check why this wasn't covered by tests.

m3nu avatar Apr 20 '18 23:04 m3nu

Merged your PR. The Tesseract module is not very frequently used and we don't test it yet. Hoping to see some improvements here over the summer.

m3nu avatar Apr 20 '18 23:04 m3nu

.decode('utf-8') was added to fix the problems which occur in Windows. See #32 and #99

duskybomb avatar May 05 '18 21:05 duskybomb

Tesseract module is not working well. There were simple issues such as $101.00 was being read as $101. 00 and hence says some fields are not matching, therefore not much point in making tests at this point (almost zero usability). I was reading more about tesseract module and they have released v4.0(beta) on 28 March 2018, I hope it works better.

duskybomb avatar May 13 '18 17:05 duskybomb

Since the time this issue has been opened there has been a lot of work done on both tesseract and invoice2data. Right now it is possible to get some great usable results.

However the results are not 100% consistent. Maybe we should reconsider this one, and add a non blocking unit test. (if that is possible)

Any pointers how to do that??

Just pinging @rmilecki to bring this issue to your attention.

bosd avatar Feb 06 '23 12:02 bosd

Sounds good of course (to add more tests). I don't have any space time to work on tesseract though, it's out of my daily usage, sorry.

rmilecki avatar Feb 18 '23 11:02 rmilecki