invoice2data
invoice2data copied to clipboard
How would you parse this line? (bol.com)
Template:
Omschrijving Aantal Prijs/st Korting Bedrag BTW% BTW
Double A printpapier - A4 - 1
1 \u20ac 22,50 \u20ac 22,50 21% \u20ac 3,90
DOOS - 5 pakken x 500 vel
Subtotaal ex. BTW \u20ac 18,60
21% BTW \u20ac 3,90
Bedrag incl. BTW \u20ac 22,50
Totaalbedrag \u20ac 22,50
This above mess is the line, I can grab everything except the description of the product using this:
\s+(?P<amount_nr>-?\d{1,10},?\d{0,2})\s+\u20ac\s+(?P<amount_piece>-?\d{1,10},\d{0,2})\s+\u20ac\s+(?P<amount_total_incl>-?\d{1,10},\d{0,2})(?:\s+(?P<tax_perc>-?\d{1,10}%)\s+\u20ac\s+(?P<amount_tax>-?\d{1,10}.?\d{0,2}))?
I tried capturing the first line of the description using this:
(?P<description>.+)[\r\n]\s+(?P<amount_nr>-?\d{1,10},?\d{0,2})\s+\u20ac\s+(?P<amount_piece>-?\d{1,10},\d{0,2})\s+\u20ac\s+(?P<amount_total_incl>-?\d{1,10},\d{0,2})(?:\s+(?P<tax_perc>-?\d{1,10}%)\s+\u20ac\s+(?P<amount_tax>-?\d{1,10}.?\d{0,2}))?
However, no lines are found at all anymore then.
Is it possible to capture the description and amounts at the same time? Or how would I approach this situation?
I'm facing the same issue, product's description lays on 2 rows but there's nothing on the row between these two. How can you capture the description in this case?
I think the problem is that the parser doesn't support line breaks in the regex. It would be a great start to at least have the first line of the description.
If somebody knows a workaround or fix, it's highly appreciated.
Struggling with the same issue on aliexpress invoices. Can you share your bol.com template?
This issue has not been solved yet, the code you're asking for is in the first post, it's just a default template otherwise.
As stated earlier, when line breaks are supported it should start to work, but someone should program that.
Just tested the code with description on a (regex101.com) Got an error on the \u parts. Buy replacing with a . seems to work. It catches the description partially.
(This is were im at on aliexpress invoices 80/20 rule)
Im running into the limitations of the debug website. Will look into this when i have acces to an install. As the module handles multi line differentially
Have you tried replacing all the line breaks? I've had some luck with that on gasstation invoices. It seems to do the replacement before it goes trough the parser.
The parser spreads the actual description on multiple lines so the output look like:
Double A printpapier - A4 - 1
1 \u20ac 22,50 \u20ac 22,50 21% \u20ac 3,90
DOOS - 5 pakken x 500 vel
Which makes it impossible to extract:
Double A printpapier - A4 - 1 DOOS - 5 pakken x 500 vel
The replacement of line breaks made it go on my invoices to something like:
Double A printpapier - A4 - 1 DOOS - 5 pakken x 500 vel 1 \u20ac 22,50 \u20ac 22,50 21% \u20ac 3,90
used this code to replace the linebreaks
options:
currency: EUR
languages:
- nl
decimal_separator: ','
replace:
- ['\n' ,'']
Forget my previous statement about removing linebreaks. I am still learning this module as well. Best bet is to use the lines plugin. Telling where to stop, start and how the first line and follow-up line looks. It's stil kinda hard to debug without the original invoice file. (Did you post the extracted or optimized string??)
try someting like:
lines:
start: Omschrijving
end: Subtotaal ex
first_line: (?P<description>\w+(?:\S|[ ]\w\w+|\n)*)[\n]?\s+\s+(?P<amount_nr>-?\d{1,10},?\d{0,2})\s+.u20ac\s+(?P<amount_piece>-?\d{1,10},\d{0,2})\s+.u20ac\s+(?P<amount_total_incl>-?\d{1,10},\d{0,2})(?:\s+(?P<tax_perc>-?\d{1,10}%)\s+.u20ac\s+(?P<amount_tax>-?\d{1,10}.?\d{0,2}))?
line: '^(?P<description>.+)$'
or
line: '^(?P<description>\w+(?:\S|[ ]\w\w+|\n)*)$'
Might still need some work on the desciption part.
Like i said, the technical code is at the top. The extracted string from the PDF first, my attempt for a regex second. My regex for multiple lines works fine, it's just that this program can't deal with such a regex apparently. The solutions is not in the template, it's in fixing the source code.
Sorry, but without the template en input file, I am unable to help.
just to be clear. Where did you get this code from? As it does not look as the original human readable pdf
Omschrijving Aantal Prijs/st Korting Bedrag BTW% BTW
Double A printpapier - A4 - 1
1 \u20ac 22,50 \u20ac 22,50 21% \u20ac 3,90
DOOS - 5 pakken x 500 vel
Subtotaal ex. BTW \u20ac 18,60
21% BTW \u20ac 3,90
Bedrag incl. BTW \u20ac 22,50
Totaalbedrag \u20ac 22,50
As wierdly as it may sound from my experience working with this module. The debug window shows different strings. I've had similar data as above. But when changing the template the parser handled the pdf document differently. It would be easier if you posted the original PDF file. (al be it anonimized). This module is quite capable of handling multiline texts. But it does require some fiddling around with the options for line extraction and tables.
example of mulltiline extraction: PDF: https://github.com/invoice-x/invoice2data/blob/master/tests/compare/QualityHosting.pdf Template: https://github.com/invoice-x/invoice2data/blob/master/src/invoice2data/extract/templates/de/de.qualityhosting.yml Output: https://github.com/invoice-x/invoice2data/blob/master/tests/compare/QualityHosting.json
Oddly, with your regexcode I do get pattern errors
There is really no easy/clean way to parse such lines. The problem is vertical alignment of table cells content.
Ideally why should ask pdftotext
to vertically align every table cell to the top. That isn't easy however as PDFs in general don't have a concept of tables. So it's hard for pdftotext
to detect table cells and handle them according to some extra requests.