input with tesseract and multiple pages
Maybe there’s an issue with PDF files with more than one page.
If the PDF file contains multiple pages, i always get only the result of one page. So i changed the tesseract input, added „pyPDF“ (to get the number of pages) and did for every single page a convert & tesseract command to get the text.
And for more accurate, i added a optional parameter for tesseract to set the path to a custom word-list file (tesseract uses the parameter --user-words). In that file i wrote all the known companynames, postal- and emailaddresses, phonenumbers and website-urls. When tesseract now uses this wordlist file, the text result is way better.
Now i‘m getting very good results with tesseract and i‘ts working great.
Maybe this will help someone?
best regards maisen
We assume that 1 PDF file = 1 invoice. Any separation would need to be done before running invoice2data.
Apart from that, a PDF can have multiple pages and the extraction will run on the combined text result. E.g. this sample file has multiple pages.
Good idea to improve Tesseract. One could probably subclass the Tesseract extraction class and one's optimizations there.
@maisen20 can you share your code/ create a pull request?
@thenaturalist sorry i'm very busy, into other projects and soon on holiday.
but maybe i can help you, where do you need help?
@maisen20 I would like to reproduce what you are describing. I guess the most efficient way to share knowledge on this is via code? As we both speak German: What do you think about talking/ screen sharing on this for 10 - 15 minutes and then I could create a pull request?
You can find my email address on the website via my GitHub profile to set something up.
Currently having this issue where only the first page is read, is the PR you're talking of on the way ?
@maisen20 @thenaturalist any update on this ?