parsers: lines: support "rules" for multiple sets of regexes
templates: make Amazon YAML use "rules" for "lines"
This doesn't really change anything and isn't necessary. It however
allows testing "rules" implementation in the "lines" parser.
lines: support "rules": field for multiple sets of parsing regexes
Sometimes companies use more than 1 format for line-parseable data. They
may e.g. randomly add some extra columns that are used occasionally.
This commit adds "rules" field support to the "lines" parser. It allows
defining multiple sets or regexes ("start", "end", "line" & friends) for
a single field.
Usage of "rules" is optional. Backward compatibility wiht existing
templates is preserved.
This is alternative implementation of feature suggested in the https://github.com/invoice-x/invoice2data/issues/377
This should resolve: https://github.com/invoice-x/invoice2data/issues/238 https://github.com/invoice-x/invoice2data/issues/377
@bosd: I think this implementation is a bit simpler than the one suggested in the https://github.com/invoice-x/invoice2data/issues/377 . I hope my changes are rather simple to understand & review.
An advantage of this approach I see is that lines parser focuses on the parser-based syntax (the new syntax).
If we want to support new features in the old syntax I believe that code should go into plugins lines code.
To start off: I'm all in, for cleaner and easier to understand code. However, I think this PR does not achieve the same thing.
edit: added txt input for easier testing
Mekro B.V. Pagina 1
Moermanstraat 4 Invoicedate : 01-11-2021 16:34
Info: Mekro Service Center 5222 BD 's-Hertogenbosch Printdatum : 01-11-2021
16:35
Telefoon: 0900-2025000
Postbus 159, 2290 AD Wateringen
www.mekro.nl
K.v.K: 33166113 OB nr: NL001799434B01
IBAN: NL44 INGB 0702 5937 02
BIC : INGBNL2A
Invoicenumber: 0/0(057)0004/041503 (004-374170) 057/119
----------------------------------------------------------------------------------------------------
Efficient Invoice Handling Klantnummer : 057 726666 01 01 SC
Sessamestreet 46
5555 NH TommyCity
----------------------------------------------------------------------------------------------------
Stuks per Prijs per Code Prijs st/kg
Barcode Description qty uom unitprice discount
----------------------------------------------------------------------------------------------------
---FOOD ITEMS---
2231012001992 KROKETBROODJES 2 KG 1,00 0,0%
8713009019455 Oil 3 L 0,50 0,0%
8713009019475 Apple 1 KG 50,0 0,0%
---OTHER ITEMS---
8713009019375 programmer 1 Hour 50,0 100,0%
0013009019475 Sticker 1 pce 0,0 0,0%
Aantal stuks: 7 Netto totaal: 5,50
Excl.BTW Code BTW BTW Totaal
0 1=21,00% 0,10 0.00
0 5= 9,00% 0,0 0,00
------------------------------------------------
53,85 6,05 59,90
----------------------------------------------------------------------------------------------------
To Pay 15,50
POI: 52001324 KLANTTICKET --------------------------------
Terminal: BS111850 Merchant: 9533494654 Period: 1305
Transactie: 00000055 Token: 2004130501564440011 AMERICAN EXPRESS
(A000000022010801) Kaart: 375382xxxxx1000 Kaartserienummer: 0
BETALING Datum: 01/11/2021 16:36 Autorisatiecode: 66
Visit www.americanexpress.nl Total: 5,50 EUR Contact
Leesmethode: CHIP Met PIN gevalideerd
Pin betaling 5,50
------------------------------------------------
Paid 5,50
test with taxes, changed the “ te betalen” bedrag.
Repeating the functional test from https://github.com/invoice-x/invoice2data/pull/378#issuecomment-1168534755
re-written the template to the syntax of this pr:
# -*- coding: utf-8 -*-
issuer: Mekro
fields:
amount: To Pay\s+(\d+.\d{2})
amount_untaxed: Netto totaal[:]\s+(\d+[,]\d{2})
date: Invoicedate\s.?\s+(\d{2}-\d{2}-\d{4})\s+\d{2}[:]\d{2}
invoice_number: Invoicenumber[:]\s+(\S+)
iban:
parser: static
value: NL44INGB0702593702
partner_coc:
parser: regex
regex: '33166113'
partner_website:
parser: regex
regex: mekro.nl
## new test here
lines:
parser: lines
rules:
- start: Barcode
line: (?P<line_note>(---FOOD ITEMS---))
end: Netto totaal
- start: Barcode
line: (?P<line_note>(---OTHER ITEMS---))
end: Netto totaal
- start: Barcode
line: (?P<barcode>(\d{13}))\s+(?P<name>(\w+(?:\s\S+)*))\s+(?P<qty>(\d))\s+(?P<uom>\w+)\s+(?P<price_unit>(\d+.\d+))\s+(?P<discount>\d+.\d+)
end: Netto totaal
keywords:
- Mekro
- NL001799434B01
options:
date_formats:
- '%d %m %Y'
currency: EUR
languages:
- en
decimal_separator: ','
Result:
[
{
"issuer": "Mekro",
"amount": 15.5,
"amount_untaxed": 5.5,
"date": "2021-01-11",
"invoice_number": "0/0(057)0004/041503",
"iban": "NL44INGB0702593702",
"partner_coc": "33166113",
"partner_website": "mekro.nl",
"lines": [
{
"line_note": "---FOOD ITEMS---"
},
{
"line_note": "---OTHER ITEMS---"
},
{
"barcode": "2231012001992",
"name": "KROKETBROODJES",
"qty": "2",
"uom": "KG",
"price_unit": "1,00",
"discount": "0,0"
},
{
"barcode": "8713009019455",
"name": "Oil",
"qty": "3",
"uom": "L",
"price_unit": "0,50",
"discount": "0,0"
},
{
"barcode": "8713009019475",
"name": "Apple",
"qty": "1",
"uom": "KG",
"price_unit": "50,0",
"discount": "0,0"
},
{
"barcode": "8713009019375",
"name": "programmer",
"qty": "1",
"uom": "Hour",
"price_unit": "50,0",
"discount": "100,0"
},
{
"barcode": "0013009019475",
"name": "Sticker",
"qty": "1",
"uom": "pce",
"price_unit": "0,0",
"discount": "0,0"
}
],
"currency": "EUR",
"desc": "Invoice from Mekro"
}
]
Conclusion, lines output is in the wrong order. @rmilecki Is it possible to achieve the same result with this code? Am I doing something wrong?
For completeness,
Here is the desired outcome of the test:
[
{
"issuer": "Mekro",
"amount": 15.5,
"amount_untaxed": 5.5,
"date": "2021-01-11",
"invoice_number": "0/0(057)0004/041503",
"iban": "NL44INGB0702593702",
"partner_coc": "33166113",
"partner_website": "mekro.nl",
"currency": "EUR",
"lines": [
{
"line_note": "---FOOD ITEMS---"
},
{
"barcode": "2231012001992",
"name": "KROKETBROODJES",
"qty": "2",
"uom": "KG",
"price_unit": "1,00",
"discount": "0,0"
},
{
"barcode": "8713009019455",
"name": "Oil",
"qty": "3",
"uom": "L",
"price_unit": "0,50",
"discount": "0,0"
},
{
"barcode": "8713009019475",
"name": "Apple",
"qty": "1",
"uom": "KG",
"price_unit": "50,0",
"discount": "0,0"
},
{
"line_note": "---OTHER ITEMS---"
},
{
"barcode": "8713009019375",
"name": "programmer",
"qty": "1",
"uom": "Hour",
"price_unit": "50,0",
"discount": "100,0"
},
{
"barcode": "0013009019475",
"name": "Sticker",
"qty": "1",
"uom": "pce",
"price_unit": "0,0",
"discount": "0,0"
}
],
"desc": "Invoice from Mekro"
}
]
@rmilecki What to do with this pr / functionality?
I need to rework this. Describe better, provide use case, test, probably avoid modifying Amazon YAML as there is no strong reason for this.
Converted into draft for now.
I think meanwhile we can focus on https://github.com/invoice-x/invoice2data/pull/423
@rmilecki No worries, You'll have some time for this. Just want to let you know I really want this..
As we've merged #417 ,
I'm adapting real invoices and template from Coolblue which we can add as an example.
Sadly, I have to conclude that (417) still is no real alternative for #378 as it does not allow to parse multiple blocks, and multiple line definitions.
Or maybe I just don't know the correct syntax :)
To start off: I'm all in, for cleaner and easier to understand code. However, I think this PR does not achieve the same thing.
That particular case ended up being discussed in the https://github.com/invoice-x/invoice2data/pull/428. It seems we can already support such invoices with current code. There may be more than 1 way of handling such complex lines - depending on expected output.
As for changes from this pull request I should rewrite them and add custom test. I'll open another pull request for that when I get it ready.
One more update: Mekro invoices can be parsed the way @bosd expected since #417. It can be done with something like:
lines:
parser: lines
start: Barcode
line:
- (?P<barcode>(\d{13}))\s+(?P<name>(\w+(?:\s\S+)*))\s+(?P<qty>(\d))\s+(?P<uom>\w+)\s+(?P<price_unit>(\d+.\d+))\s+(?P<discount>\d+.\d+)
- ---(?P<line_note>.*ITEMS)---
end: Netto totaal
Above template fragment results in parsing invoice provided by @bosd into:
"lines": [
{
"line_note": "FOOD ITEMS"
},
{
"barcode": "2231012001992",
"name": "KROKETBROODJES",
"qty": "2",
"uom": "KG",
"price_unit": "1,00",
"discount": "0,0"
},
{
"barcode": "8713009019455",
"name": "Oil",
"qty": "3",
"uom": "L",
"price_unit": "0,50",
"discount": "0,0"
},
{
"barcode": "8713009019475",
"name": "Apple",
"qty": "1",
"uom": "KG",
"price_unit": "50,0",
"discount": "0,0"
},
{
"line_note": "OTHER ITEMS"
},
{
"barcode": "8713009019375",
"name": "programmer",
"qty": "1",
"uom": "Hour",
"price_unit": "50,0",
"discount": "100,0"
},
{
"barcode": "0013009019475",
"name": "Sticker",
"qty": "1",
"uom": "pce",
"price_unit": "0,0",
"discount": "0,0"
}
]
(which seems to match what was expected).
As for coolblue invoices those are more tricky, it's even hard to agree on ideal expected output. That it being discussed in the #428.