invoice2data icon indicating copy to clipboard operation
invoice2data copied to clipboard

parsers: lines: support "rules" for multiple sets of regexes

Open rmilecki opened this issue 3 years ago • 4 comments

templates: make Amazon YAML use "rules" for "lines"

This doesn't really change anything and isn't necessary. It however
allows testing "rules" implementation in the "lines" parser.
lines: support "rules": field for multiple sets of parsing regexes

Sometimes companies use more than 1 format for line-parseable data. They
may e.g. randomly add some extra columns that are used occasionally.

This commit adds "rules" field support to the "lines" parser. It allows
defining multiple sets or regexes ("start", "end", "line" & friends) for
a single field.

Usage of "rules" is optional. Backward compatibility wiht existing
templates is preserved.

rmilecki avatar Sep 25 '22 09:09 rmilecki

This is alternative implementation of feature suggested in the https://github.com/invoice-x/invoice2data/issues/377

This should resolve: https://github.com/invoice-x/invoice2data/issues/238 https://github.com/invoice-x/invoice2data/issues/377

rmilecki avatar Sep 25 '22 09:09 rmilecki

@bosd: I think this implementation is a bit simpler than the one suggested in the https://github.com/invoice-x/invoice2data/issues/377 . I hope my changes are rather simple to understand & review.

An advantage of this approach I see is that lines parser focuses on the parser-based syntax (the new syntax).

If we want to support new features in the old syntax I believe that code should go into plugins lines code.

rmilecki avatar Sep 25 '22 10:09 rmilecki

To start off: I'm all in, for cleaner and easier to understand code. However, I think this PR does not achieve the same thing.

edit: added txt input for easier testing

Mekro B.V. Pagina 1
Moermanstraat 4 Invoicedate : 01-11-2021 16:34
Info: Mekro Service Center 5222 BD 's-Hertogenbosch Printdatum : 01-11-2021
16:35
Telefoon: 0900-2025000
Postbus 159, 2290 AD Wateringen
www.mekro.nl
K.v.K: 33166113 OB nr: NL001799434B01
IBAN: NL44 INGB 0702 5937 02
BIC : INGBNL2A
Invoicenumber: 0/0(057)0004/041503 (004-374170) 057/119
----------------------------------------------------------------------------------------------------
Efficient Invoice Handling Klantnummer : 057 726666 01 01 SC
Sessamestreet 46
5555 NH TommyCity
----------------------------------------------------------------------------------------------------
Stuks per Prijs per Code Prijs st/kg
Barcode                 Description          qty uom         unitprice discount
----------------------------------------------------------------------------------------------------
---FOOD ITEMS---
2231012001992 KROKETBROODJES                             2 KG           1,00 0,0%
8713009019455 Oil                                        3 L           0,50 0,0%
8713009019475 Apple                                      1 KG           50,0 0,0%
---OTHER ITEMS---
8713009019375 programmer                                 1 Hour        50,0 100,0%
0013009019475 Sticker                                    1 pce          0,0 0,0%

Aantal stuks: 7 Netto totaal: 5,50
Excl.BTW Code BTW BTW Totaal
0 1=21,00% 0,10 0.00
0 5= 9,00%           0,0 0,00
------------------------------------------------
53,85        6,05          59,90
----------------------------------------------------------------------------------------------------
To Pay 15,50
POI: 52001324 KLANTTICKET --------------------------------
Terminal: BS111850 Merchant: 9533494654 Period: 1305
Transactie: 00000055 Token: 2004130501564440011 AMERICAN EXPRESS
(A000000022010801) Kaart: 375382xxxxx1000 Kaartserienummer: 0
BETALING Datum: 01/11/2021 16:36 Autorisatiecode: 66
Visit www.americanexpress.nl Total: 5,50 EUR Contact
Leesmethode: CHIP Met PIN gevalideerd
Pin betaling 5,50
------------------------------------------------
Paid 5,50
test with taxes, changed the “ te betalen” bedrag.


Repeating the functional test from https://github.com/invoice-x/invoice2data/pull/378#issuecomment-1168534755

re-written the template to the syntax of this pr:


# -*- coding: utf-8 -*-
issuer: Mekro
fields:
  amount: To Pay\s+(\d+.\d{2})
  amount_untaxed: Netto totaal[:]\s+(\d+[,]\d{2})
  date: Invoicedate\s.?\s+(\d{2}-\d{2}-\d{4})\s+\d{2}[:]\d{2}
  invoice_number: Invoicenumber[:]\s+(\S+)
  iban:
    parser: static
    value: NL44INGB0702593702
  partner_coc:
    parser: regex
    regex: '33166113'
  partner_website:
    parser: regex
    regex: mekro.nl
## new test here
  lines:
    parser: lines
    rules:
    - start: Barcode
      line: (?P<line_note>(---FOOD ITEMS---))
      end: Netto totaal
    - start: Barcode
      line: (?P<line_note>(---OTHER ITEMS---))
      end: Netto totaal
    - start: Barcode
      line: (?P<barcode>(\d{13}))\s+(?P<name>(\w+(?:\s\S+)*))\s+(?P<qty>(\d))\s+(?P<uom>\w+)\s+(?P<price_unit>(\d+.\d+))\s+(?P<discount>\d+.\d+)
      end: Netto totaal
keywords:
  - Mekro
  - NL001799434B01
options:
  date_formats:
    - '%d %m %Y'
  currency: EUR
  languages:
    - en
  decimal_separator: ','

Result:

[
    {
        "issuer": "Mekro",
        "amount": 15.5,
        "amount_untaxed": 5.5,
        "date": "2021-01-11",
        "invoice_number": "0/0(057)0004/041503",
        "iban": "NL44INGB0702593702",
        "partner_coc": "33166113",
        "partner_website": "mekro.nl",
        "lines": [
            {
                "line_note": "---FOOD ITEMS---"
            },
            {
                "line_note": "---OTHER ITEMS---"
            },
            {
                "barcode": "2231012001992",
                "name": "KROKETBROODJES",
                "qty": "2",
                "uom": "KG",
                "price_unit": "1,00",
                "discount": "0,0"
            },
            {
                "barcode": "8713009019455",
                "name": "Oil",
                "qty": "3",
                "uom": "L",
                "price_unit": "0,50",
                "discount": "0,0"
            },
            {
                "barcode": "8713009019475",
                "name": "Apple",
                "qty": "1",
                "uom": "KG",
                "price_unit": "50,0",
                "discount": "0,0"
            },
            {
                "barcode": "8713009019375",
                "name": "programmer",
                "qty": "1",
                "uom": "Hour",
                "price_unit": "50,0",
                "discount": "100,0"
            },
            {
                "barcode": "0013009019475",
                "name": "Sticker",
                "qty": "1",
                "uom": "pce",
                "price_unit": "0,0",
                "discount": "0,0"
            }
        ],
        "currency": "EUR",
        "desc": "Invoice from Mekro"
    }
]

Conclusion, lines output is in the wrong order. @rmilecki Is it possible to achieve the same result with this code? Am I doing something wrong?

bosd avatar Sep 25 '22 11:09 bosd

For completeness,

Here is the desired outcome of the test:

[
    {
        "issuer": "Mekro",
        "amount": 15.5,
        "amount_untaxed": 5.5,
        "date": "2021-01-11",
        "invoice_number": "0/0(057)0004/041503",
        "iban": "NL44INGB0702593702",
        "partner_coc": "33166113",
        "partner_website": "mekro.nl",
        "currency": "EUR",
        "lines": [
            {
                "line_note": "---FOOD ITEMS---"
            },
            {
                "barcode": "2231012001992",
                "name": "KROKETBROODJES",
                "qty": "2",
                "uom": "KG",
                "price_unit": "1,00",
                "discount": "0,0"
            },
            {
                "barcode": "8713009019455",
                "name": "Oil",
                "qty": "3",
                "uom": "L",
                "price_unit": "0,50",
                "discount": "0,0"
            },
            {
                "barcode": "8713009019475",
                "name": "Apple",
                "qty": "1",
                "uom": "KG",
                "price_unit": "50,0",
                "discount": "0,0"
            },
            {
                "line_note": "---OTHER ITEMS---"
            },
            {
                "barcode": "8713009019375",
                "name": "programmer",
                "qty": "1",
                "uom": "Hour",
                "price_unit": "50,0",
                "discount": "100,0"
            },
            {
                "barcode": "0013009019475",
                "name": "Sticker",
                "qty": "1",
                "uom": "pce",
                "price_unit": "0,0",
                "discount": "0,0"
            }
        ],
        "desc": "Invoice from Mekro"
    }
]

bosd avatar Sep 25 '22 12:09 bosd

@rmilecki What to do with this pr / functionality?

bosd avatar Oct 22 '22 09:10 bosd

I need to rework this. Describe better, provide use case, test, probably avoid modifying Amazon YAML as there is no strong reason for this.

Converted into draft for now.

I think meanwhile we can focus on https://github.com/invoice-x/invoice2data/pull/423

rmilecki avatar Oct 22 '22 14:10 rmilecki

@rmilecki No worries, You'll have some time for this. Just want to let you know I really want this..

As we've merged #417 ,
I'm adapting real invoices and template from Coolblue which we can add as an example. Sadly, I have to conclude that (417) still is no real alternative for #378 as it does not allow to parse multiple blocks, and multiple line definitions. Or maybe I just don't know the correct syntax :)

bosd avatar Oct 22 '22 15:10 bosd

To start off: I'm all in, for cleaner and easier to understand code. However, I think this PR does not achieve the same thing.

That particular case ended up being discussed in the https://github.com/invoice-x/invoice2data/pull/428. It seems we can already support such invoices with current code. There may be more than 1 way of handling such complex lines - depending on expected output.

As for changes from this pull request I should rewrite them and add custom test. I'll open another pull request for that when I get it ready.

rmilecki avatar Feb 03 '23 23:02 rmilecki

One more update: Mekro invoices can be parsed the way @bosd expected since #417. It can be done with something like:

  lines:
    parser: lines
    start: Barcode
    line:
      - (?P<barcode>(\d{13}))\s+(?P<name>(\w+(?:\s\S+)*))\s+(?P<qty>(\d))\s+(?P<uom>\w+)\s+(?P<price_unit>(\d+.\d+))\s+(?P<discount>\d+.\d+)
      - ---(?P<line_note>.*ITEMS)---
    end: Netto totaal

Above template fragment results in parsing invoice provided by @bosd into:

        "lines": [
            {
                "line_note": "FOOD ITEMS"
            },
            {
                "barcode": "2231012001992",
                "name": "KROKETBROODJES",
                "qty": "2",
                "uom": "KG",
                "price_unit": "1,00",
                "discount": "0,0"
            },
            {
                "barcode": "8713009019455",
                "name": "Oil",
                "qty": "3",
                "uom": "L",
                "price_unit": "0,50",
                "discount": "0,0"
            },
            {
                "barcode": "8713009019475",
                "name": "Apple",
                "qty": "1",
                "uom": "KG",
                "price_unit": "50,0",
                "discount": "0,0"
            },
            {
                "line_note": "OTHER ITEMS"
            },
            {
                "barcode": "8713009019375",
                "name": "programmer",
                "qty": "1",
                "uom": "Hour",
                "price_unit": "50,0",
                "discount": "100,0"
            },
            {
                "barcode": "0013009019475",
                "name": "Sticker",
                "qty": "1",
                "uom": "pce",
                "price_unit": "0,0",
                "discount": "0,0"
            }
        ]

(which seems to match what was expected).


As for coolblue invoices those are more tricky, it's even hard to agree on ideal expected output. That it being discussed in the #428.

rmilecki avatar Feb 18 '23 21:02 rmilecki