invoice2data icon indicating copy to clipboard operation
invoice2data copied to clipboard

How would you parse this line? (bol.com)

Open MrMoronIV opened this issue 3 years ago • 10 comments

Template:

Omschrijving                    Aantal     Prijs/st   Korting     Bedrag     BTW%               BTW


Double A printpapier - A4 - 1
                                    1    \u20ac 22,50                \u20ac 22,50        21%         \u20ac     3,90
DOOS - 5 pakken x 500 vel




                                              Subtotaal ex. BTW                        \u20ac       18,60
                                              21% BTW                                  \u20ac       3,90
                                              Bedrag incl. BTW                         \u20ac       22,50


                                              Totaalbedrag                             \u20ac       22,50

This above mess is the line, I can grab everything except the description of the product using this:

\s+(?P<amount_nr>-?\d{1,10},?\d{0,2})\s+\u20ac\s+(?P<amount_piece>-?\d{1,10},\d{0,2})\s+\u20ac\s+(?P<amount_total_incl>-?\d{1,10},\d{0,2})(?:\s+(?P<tax_perc>-?\d{1,10}%)\s+\u20ac\s+(?P<amount_tax>-?\d{1,10}.?\d{0,2}))?

I tried capturing the first line of the description using this:

(?P<description>.+)[\r\n]\s+(?P<amount_nr>-?\d{1,10},?\d{0,2})\s+\u20ac\s+(?P<amount_piece>-?\d{1,10},\d{0,2})\s+\u20ac\s+(?P<amount_total_incl>-?\d{1,10},\d{0,2})(?:\s+(?P<tax_perc>-?\d{1,10}%)\s+\u20ac\s+(?P<amount_tax>-?\d{1,10}.?\d{0,2}))?

However, no lines are found at all anymore then.

Is it possible to capture the description and amounts at the same time? Or how would I approach this situation?

MrMoronIV avatar Oct 01 '21 10:10 MrMoronIV

I'm facing the same issue, product's description lays on 2 rows but there's nothing on the row between these two. How can you capture the description in this case?

sergiuturus avatar Oct 05 '21 06:10 sergiuturus

I think the problem is that the parser doesn't support line breaks in the regex. It would be a great start to at least have the first line of the description.

If somebody knows a workaround or fix, it's highly appreciated.

MrMoronIV avatar Oct 08 '21 04:10 MrMoronIV

Struggling with the same issue on aliexpress invoices. Can you share your bol.com template?

bosd avatar Jan 27 '22 12:01 bosd

This issue has not been solved yet, the code you're asking for is in the first post, it's just a default template otherwise.

As stated earlier, when line breaks are supported it should start to work, but someone should program that.

MrMoronIV avatar Jan 28 '22 07:01 MrMoronIV

Just tested the code with description on a (regex101.com) Got an error on the \u parts. Buy replacing with a . seems to work. It catches the description partially.

(This is were im at on aliexpress invoices 80/20 rule)

Im running into the limitations of the debug website. Will look into this when i have acces to an install. As the module handles multi line differentially

bosd avatar Jan 28 '22 08:01 bosd

Have you tried replacing all the line breaks? I've had some luck with that on gasstation invoices. It seems to do the replacement before it goes trough the parser.

The parser spreads the actual description on multiple lines so the output look like:

Double A printpapier - A4 - 1
                                    1    \u20ac 22,50                \u20ac 22,50        21%         \u20ac     3,90
DOOS - 5 pakken x 500 vel

Which makes it impossible to extract:

Double A printpapier - A4 - 1 DOOS - 5 pakken x 500 vel

The replacement of line breaks made it go on my invoices to something like:

Double A printpapier - A4 - 1 DOOS - 5 pakken x 500 vel  1    \u20ac 22,50                \u20ac 22,50        21%         \u20ac     3,90

used this code to replace the linebreaks

options:
  currency: EUR
  languages:
    - nl
  decimal_separator: ','
  replace:
    - ['\n' ,'']

bosd avatar Jan 29 '22 19:01 bosd

Forget my previous statement about removing linebreaks. I am still learning this module as well. Best bet is to use the lines plugin. Telling where to stop, start and how the first line and follow-up line looks. It's stil kinda hard to debug without the original invoice file. (Did you post the extracted or optimized string??)

try someting like:

lines:
    start: Omschrijving
    end: Subtotaal ex
    first_line:  (?P<description>\w+(?:\S|[ ]\w\w+|\n)*)[\n]?\s+\s+(?P<amount_nr>-?\d{1,10},?\d{0,2})\s+.u20ac\s+(?P<amount_piece>-?\d{1,10},\d{0,2})\s+.u20ac\s+(?P<amount_total_incl>-?\d{1,10},\d{0,2})(?:\s+(?P<tax_perc>-?\d{1,10}%)\s+.u20ac\s+(?P<amount_tax>-?\d{1,10}.?\d{0,2}))?
    line:  '^(?P<description>.+)$'

or line: '^(?P<description>\w+(?:\S|[ ]\w\w+|\n)*)$'

Might still need some work on the desciption part.

bosd avatar Feb 01 '22 07:02 bosd

Like i said, the technical code is at the top. The extracted string from the PDF first, my attempt for a regex second. My regex for multiple lines works fine, it's just that this program can't deal with such a regex apparently. The solutions is not in the template, it's in fixing the source code.

MrMoronIV avatar Feb 01 '22 07:02 MrMoronIV

Sorry, but without the template en input file, I am unable to help.

just to be clear. Where did you get this code from? As it does not look as the original human readable pdf

Omschrijving                    Aantal     Prijs/st   Korting     Bedrag     BTW%               BTW


Double A printpapier - A4 - 1
                                    1    \u20ac 22,50                \u20ac 22,50        21%         \u20ac     3,90
DOOS - 5 pakken x 500 vel




                                              Subtotaal ex. BTW                        \u20ac       18,60
                                              21% BTW                                  \u20ac       3,90
                                              Bedrag incl. BTW                         \u20ac       22,50


                                              Totaalbedrag                             \u20ac       22,50

As wierdly as it may sound from my experience working with this module. The debug window shows different strings. I've had similar data as above. But when changing the template the parser handled the pdf document differently. It would be easier if you posted the original PDF file. (al be it anonimized). This module is quite capable of handling multiline texts. But it does require some fiddling around with the options for line extraction and tables.

example of mulltiline extraction: PDF: https://github.com/invoice-x/invoice2data/blob/master/tests/compare/QualityHosting.pdf Template: https://github.com/invoice-x/invoice2data/blob/master/src/invoice2data/extract/templates/de/de.qualityhosting.yml Output: https://github.com/invoice-x/invoice2data/blob/master/tests/compare/QualityHosting.json

Oddly, with your regexcode I do get pattern errors

bosd avatar Feb 01 '22 14:02 bosd

There is really no easy/clean way to parse such lines. The problem is vertical alignment of table cells content.

Ideally why should ask pdftotext to vertically align every table cell to the top. That isn't easy however as PDFs in general don't have a concept of tables. So it's hard for pdftotext to detect table cells and handle them according to some extra requests.

rmilecki avatar Aug 06 '23 17:08 rmilecki