invoice2data
invoice2data copied to clipboard
Lines: support multiple RegEx-es in lines entries
So far "first_line", "line" and "last_line" could contain a single
RegEx only. Some invoices have lines that use more than one format. To
simplify parsin them allow all 3 entries to contain list of RegEx-es.
Example:
fields:
lines:
parser: lines
start: Item\s+Discount\s+Price$
end: \s+Total
line:
- Items group:\s+(?P<group>.+)
- (?P<description>.+)\s+(?P<discount>\d+.\d+)\s+(?P<price>\d+\d+)
This is my alternative to the https://github.com/invoice-x/invoice2data/pull/378 that should really work this time.
@bosd: I believe you can parse your Mekro.pdf
with:
# -*- coding: utf-8 -*-
issuer: Mekro
fields:
amount: To Pay\s+(\d+.\d{2})
amount_untaxed: Netto totaal[:]\s+(\d+[,]\d{2})
date: Invoicedate\s.?\s+(\d{2}-\d{2}-\d{4})\s+\d{2}[:]\d{2}
invoice_number: Invoicenumber[:]\s+(\S+)
iban:
parser: static
value: NL44INGB0702593702
partner_coc:
parser: regex
regex: '33166113'
partner_website:
parser: regex
regex: mekro.nl
lines:
parser: lines
start: Barcode
line:
- (?P<barcode>(\d{13}))\s+(?P<name>(\w+(?:\s\S+)*))\s+(?P<qty>(\d))\s+(?P<uom>\w+)\s+(?P<price_unit>(\d+.\d+))\s+(?P<discount>\d+.\d+)
- ---(?P<line_note>.*ITEMS)---
end: Netto totaal
keywords:
- Mekro
- NL001799434B01
options:
date_formats:
- '%d %m %Y'
currency: EUR
languages:
- en
decimal_separator: ','
that template gives me:
"lines": [
{
"line_note": "FOOD ITEMS"
},
{
"barcode": "2231012001992",
"name": "KROKETBROODJES",
"qty": "2",
"uom": "KG",
"price_unit": "1,00",
"discount": "0,0"
},
{
"barcode": "8713009019455",
"name": "Oil",
"qty": "3",
"uom": "L",
"price_unit": "0,50",
"discount": "0,0"
},
{
"barcode": "8713009019475",
"name": "Apple",
"qty": "1",
"uom": "KG",
"price_unit": "50,0",
"discount": "0,0"
},
{
"line_note": "OTHER ITEMS"
},
{
"barcode": "8713009019375",
"name": "programmer",
"qty": "1",
"uom": "Hour",
"price_unit": "50,0",
"discount": "100,0"
},
{
"barcode": "0013009019475",
"name": "Sticker",
"qty": "1",
"uom": "pce",
"price_unit": "0,0",
"discount": "0,0"
}
]
which I believe is what you expected.
@rmilecki Thanks for your efforts to produce an alternative with cleaner code.
Running the first test, on quality hosting example file. Purpose to check if contentate function is working.
This is the result:
[
{
"issuer": "QualityHosting AG",
"amount": 34.73,
"amount_untaxed": 34.73,
"date": "2014-05-07",
"invoice_number": "30064443",
"vat": "DE 232 446 240",
"currency": "EUR",
"lines": [
{
"pos": "1",
"qty": 1.0,
"desc": "Small Business StandardExchange 2010\nGrundgebühr pro Einheit\nDienst: OUDJQ_strukan\n01.05.14-31.05.14",
"price": 3.89
},
{
"pos": "2",
"qty": 1.0,
"desc": "Small Business QualityExchange 2010\nGrundgebühr pro Einheit\nDienst: OUDJQ_schneider\n01.05.14-31.05.14",
"price": 5.39
},
{
"pos": "3",
"qty": 1.0,
"desc": "Small Business QualityExchange 2010\nGrundgebühr pro Einheit\nDienst: OUDJQ_minar\n01.05.14-31.05.14",
"price": 5.39
},
{
"pos": "4",
"qty": 1.0,
"desc": "Small Business QualityExchange 2010\nGrundgebühr pro Einheit\nDienst: OUDJQ_mayr\n01.05.14-31.05.14",
"price": 5.39
},
{
"pos": "5",
"qty": 1.0,
"desc": "Small Business QualityExchange 2010\nGrundgebühr pro Einheit\nDienst: OUDJQ_jenewein\n01.05.14-31.05.14",
"price": 5.39
},
{
"pos": "6",
"qty": 1.0,
"desc": "Small Business QualityExchange 2010\nGrundgebühr pro Einheit\nDienst: OUDJQ_jauernik\n01.05.14-31.05.14\nQualityHosting AG - Uferweg 40-42 - D-63571 Gelnhausen\niViveLabs Ltd.\n93B Sai Yu Chung\nYuen Long, N.T.\nHong Kong\nPos. Menge Beschreibung Rabatt % VK-Preis Zeilenbetrag\nOhne Ohne MwSt.\nMwSt.",
"price": 5.39
},
{
"pos": "7",
"qty": 1.0,
"desc": "Small Business StandardExchange 2010\nGrundgebühr pro Einheit\nDienst: OUDJQ_office\n01.05.14-31.05.14\n",
"price": 3.89
}
],
"desc": "Invoice from QualityHosting AG"
}
]
Note: it contantate correctly.
But something odd is going on in the last line of the first page. It (incorrectly) adds the footer content to the line:
"pos": "6",
"qty": 1.0,
"desc": "Small Business QualityExchange 2010\nGrundgebühr pro Einheit\nDienst: OUDJQ_jauernik\n01.05.14-31.05.14\nQualityHosting AG - Uferweg 40-42 - D-63571 Gelnhausen\niViveLabs Ltd.\n93B Sai Yu Chung\nYuen Long, N.T.\nHong Kong\nPos. Menge Beschreibung Rabatt % VK-Preis Zeilenbetrag\nOhne Ohne MwSt.\nMwSt.",
"price": 5.39
},
The output from pdftotext for this part is:
Grundgebühr pro Einheit
Dienst: OUDJQ_jauernik
01.05.14-31.05.14
QualityHosting AG Vorstand: Christian Heit (Vorsitz), Bankverbindung
Uferweg 40-42 Markus Oestreicher Kreissparkasse Gelnhausen
note the line
QualityHosting AG Vorstand: Christian Heit (Vorsitz),
It does not contain a white space character at the beginning of the line. According to the invoice template it should not match.
line: '^\s+(?P<desc>.+)$'
But it does
@bosd: can you paste or send me full pdftotext
output, please? Feel free to replace your private data there with random text. That will make it much much easier for me to debug that problem.
Here it is: (No personal info there, this is one of the example files included in this library)
DEBUG:invoice2data.main:START pdftotext result ===========================
DEBUG:invoice2data.main:
QualityHosting AG - Uferweg 40-42 - D-63571 Gelnhausen
iViveLabs Ltd.
93B Sai Yu Chung
Yuen Long, N.T.
Hong Kong
Rechnung Seite 1
Rechnungsnr. 30064443 Kundennr. 47774
Rechnungsdatum 7. Mai 2014
Pos. Menge Beschreibung Rabatt % VK-Preis Zeilenbetrag
Ohne Ohne MwSt.
MwSt.
Contract No. CON02858
1 1 Small Business StandardExchange 2010 3,89 3,89
Grundgebühr pro Einheit
Dienst: OUDJQ_strukan
01.05.14-31.05.14
2 1 Small Business QualityExchange 2010 5,39 5,39
Grundgebühr pro Einheit
Dienst: OUDJQ_schneider
01.05.14-31.05.14
3 1 Small Business QualityExchange 2010 5,39 5,39
Grundgebühr pro Einheit
Dienst: OUDJQ_minar
01.05.14-31.05.14
4 1 Small Business QualityExchange 2010 5,39 5,39
Grundgebühr pro Einheit
Dienst: OUDJQ_mayr
01.05.14-31.05.14
5 1 Small Business QualityExchange 2010 5,39 5,39
Grundgebühr pro Einheit
Dienst: OUDJQ_jenewein
01.05.14-31.05.14
6 1 Small Business QualityExchange 2010 5,39 5,39
Grundgebühr pro Einheit
Dienst: OUDJQ_jauernik
01.05.14-31.05.14
QualityHosting AG Vorstand: Christian Heit (Vorsitz), Bankverbindung
Uferweg 40-42 Markus Oestreicher Kreissparkasse Gelnhausen
D-63571 Gelnhausen Aufsichtsrat: Hans Jürgen Kto-Nr. 48567
Tel. +49 6051 916 44 10 Habermann (Vorsitz) Blz: 507 500 94
Fax +49 6051 916 44 29 Registergericht Hanau | HRB 13302 IBAN DE30507500940000048567
Im Internet: www.qualityhosting.de UStId DE 232 446 240 SWIFT HELADEF1GEL
eMail: [email protected] Steuer-Nr. 044 241 601 03
QualityHosting AG - Uferweg 40-42 - D-63571 Gelnhausen
iViveLabs Ltd.
93B Sai Yu Chung
Yuen Long, N.T.
Hong Kong
Rechnung Seite 2
Rechnungsnr. 30064443 Kundennr. 47774
Rechnungsdatum 7. Mai 2014
Pos. Menge Beschreibung Rabatt % VK-Preis Zeilenbetrag
Ohne Ohne MwSt.
MwSt.
7 1 Small Business StandardExchange 2010 3,89 3,89
Grundgebühr pro Einheit
Dienst: OUDJQ_office
01.05.14-31.05.14
Total EUR 34,73
Zahlungsform Banküberweisung
Zahlungsbedingungen 14 Tage netto
Zahlungsziel 21.05.14
Für Rückfragen bzgl. dieser Rechnung wenden SIe sich bitte per E-Mail unter der Angabe Ihrer Kunden- und Rechnungsnummer an
[email protected].
Einwände gegen die Ihnen berechneten Lieferungen und Leistungen sind schriftlich innerhalb 4 Wochen ab Rechnungsdatum unserer
Buchhaltung anzuzeigen. Nach Ablauf dieser Frist gelten die Beträge als genehmigt. Im Falle einer Rücklastschrift der Beträge ohne
Verschulden der QualityHosting AG berechnen wir für die uns entstandenen Kosten ein Entgeld von 15,00 EUR. Unabhängig davon
behalten wir uns die Einstellung unserer Leistungen bis zum Ausgleich unserer Forderungen ausdrücklich vor.
QualityHosting AG Vorstand: Christian Heit (Vorsitz), Bankverbindung
Uferweg 40-42 Markus Oestreicher Kreissparkasse Gelnhausen
D-63571 Gelnhausen Aufsichtsrat: Hans Jürgen Kto-Nr. 48567
Tel. +49 6051 916 44 10 Habermann (Vorsitz) Blz: 507 500 94
Fax +49 6051 916 44 29 Registergericht Hanau | HRB 13302 IBAN DE30507500940000048567
Im Internet: www.qualityhosting.de UStId DE 232 446 240 SWIFT HELADEF1GEL
eMail: [email protected] Steuer-Nr. 044 241 601 03
DEBUG:invoice2data.main:END pdftotext result =============================
@bosd: OK, I just found it's about QualityHosting.pdf
.
That problem you reported - about parsing lines
in QualityHosting.pdf
- is caused by the way src/invoice2data/extract/templates/de/de.qualityhosting.yml
is constructed. It has nothing to to with changes from this pull request.
-
de.qualityhosting.yml
parseslines
incorrectly without this pull request changes -
de.qualityhosting.yml
parseslines
incorrectly with those changes
So while I agree de.qualityhosting.yml
needs to be fixed it has really nothing to do with this pull request. I'm happy to help you fixing de.qualityhosting.yml
if you need to parse QualityHosting invoices. It shouldn't be considered a blocked for this pull request however.
@rmilecki
- de.qualityhosting.yml parses lines incorrectly without this pull request changes
I agree with you on this one.
de.qualityhosting.yml parses lines incorrectly with those changes
I think that is something that needs to be fixed. as of in #378 where this pr is an alternative implementation of that.
I'll respectfully disagree with you on changing the invoice template.
Yes, the invoice template is not optimal.
(It's a different topic, but as performance wise the .
character (meta escape) should be avoided in python regexes.)
In this case, it is possible to add a last line rule, as of your proposal in #422.
However, I found many practical cases where it is impossible to define a lastline in the template.
In that case, all non-matching lines (like footers) should be discarded, until it finds a new first_line
match.
Maybe better to leave suboptimal tests and examples in this library. Just as an showcase. (Same goes for the OCR examples in this repo). It is definityle helping us to find these corner cases.
However, the template set aside.
My analysis of whats happening here. Instead of passing one line to the regex pattern match, it looks like it is trying to match the regex across the whole content. (Did not dive into the code yet to verify this).
The regex is as follows:
\s+
matches any whitespace character (equivalent to [\r\n\t\f\v ])
- matches the previous token between one and unlimited times, as many times as possible.
When looking at the line:
QualityHosting AG Vorstand: Christian Heit (Vorsitz), Bankverbindung
It does not contain a whitespace character at the beginning of the line. So it should not be used in the output. (This is the implementation of #378).
However, if we feed the following into the parser:
QualityHosting AG Vorstand: Christian Heit (Vorsitz), Bankverbindung
An linebreak \n has been added in the line above our footer. Now the regex
\s+(?P<desc>.+)
Is matching on the first line of the footer. It is matching because it is using the linebreak from the previous line.
This should not happen, as the linebreak is clearly on another line.
@bosd: in your analysis of \s+
treating I think you misread input invoice. That line: '^\s+(?P<desc>.+)$'
in de.qualityhosting.yml
works as you expected. It matches only those lines that start with whitespaces.
When looking at the line:
QualityHosting AG Vorstand: Christian Heit (Vorsitz), Bankverbindung
It does not contain a whitespace character at the beginning of the line. So it should not be used in the output.
I agree with you. I believe you are correct. It should not be used in the output and it isn't used in the output.
So things work just like you expect them.
Please check this pdftotext
output with my comments (scroll RIGHT please!):
6 1 Small Business QualityExchange 2010 5,39 5,39
Grundgebühr pro Einheit
Dienst: OUDJQ_jauernik
01.05.14-31.05.14
QualityHosting AG Vorstand: Christian Heit (Vorsitz), Bankverbindung ← doesn't start with whitespace = doesn't appear in the output
Uferweg 40-42 Markus Oestreicher Kreissparkasse Gelnhausen ← doesn't start with whitespace = doesn't appear in the output
D-63571 Gelnhausen Aufsichtsrat: Hans Jürgen Kto-Nr. 48567 ← doesn't start with whitespace = doesn't appear in the output
Tel. +49 6051 916 44 10 Habermann (Vorsitz) Blz: 507 500 94 ← doesn't start with whitespace = doesn't appear in the output
Fax +49 6051 916 44 29 Registergericht Hanau | HRB 13302 IBAN DE30507500940000048567 ← doesn't start with whitespace = doesn't appear in the output
Im Internet: www.qualityhosting.de UStId DE 232 446 240 SWIFT HELADEF1GEL ← doesn't start with whitespace = doesn't appear in the output
eMail: [email protected] Steuer-Nr. 044 241 601 03
QualityHosting AG - Uferweg 40-42 - D-63571 Gelnhausen ← starts with whitespace = appears in the output = expected
iViveLabs Ltd. ← starts with whitespace = appears in the output = expected
93B Sai Yu Chung ← starts with whitespace = appears in the output = expected
Yuen Long, N.T. ← starts with whitespace = appears in the output = expected
Hong Kong ← starts with whitespace = appears in the output = expected
Rechnung Seite 2 ← doesn't start with whitespace = doesn't appear in the output
Rechnungsnr. 30064443 Kundennr. 47774 ← doesn't start with whitespace = doesn't appear in the output
Rechnungsdatum 7. Mai 2014
Pos. Menge Beschreibung Rabatt % VK-Preis Zeilenbetrag
Ohne Ohne MwSt.
MwSt.
7 1 Small Business StandardExchange 2010 3,89 3,89
Grundgebühr pro Einheit
Dienst: OUDJQ_office
01.05.14-31.05.14
So I think everything works correct and just like described that you expect them to.
@rmilecki Thanks for the very clear information. Makes sense to change the template, to only include a line when there are at least X amount of spaces on the beginning of the line.
@bosd: so could we have this one merged now, please?
It's a clear implementation, solves actual problem you reported, doesn't seem to regress anything. I find it a nice feature.
It's not meant to solve all cases our templates can't handle now. But it does solve one and I believe it's worth to have it.
We can work on handling more cases in further changes (e.g. #407, #423) but we need to start moving forward with something. I'm happy to discuss and work together on other cases later. At the same time I'd like to start merging proposed features.
@rmilecki I want to move this one forward as well. As I really want to have this functionality. Yet, first I would like to test is a bit more thourougly. It is quite big, to test this, (will use real use case templates and invoices) Hope I can do that this weekend, and give the approval then.
@rmilecki Thanks for collaborating on this and all your efforts! Let's Merge!! :tada: :sparkles:
Tested this code against a bunch of pdf's and templates locally with great success!!!!
Some notes:
- The parsing of different "blocks" is not working with the following syntax.
lines:
- start: Efficient Invoice Handling
end: 5555 NH TommyCity
line: (?P<test>(Sessamestreet 46))
- start: Barcode
end: Netto totaal
line: (?P<barcode>(\d{13}))\s+(?P<name>(\w+(?:\s\S+)*))\s+(?P<qty>(\d))\s+(?P<uom>\w+)\s+(?P<price_unit>(\d+.\d+))\s+(?P<discount>\d+.\d+)
I tried a variant with the new syntax, but could not make it work. What would be the right syntax to parse multiple (different) blocks of lines?? Since this is related to #423 , which is about the parsing of multiple similar blocks. I would not consider it to block this pr.
- The documentation / examples need updating. To show the correct syntax to parse multiple lines. eg.
line:
- (?P<barcode>(\d{13}))\s+(?P<name>(\w+(?:\s\S+)*))\s+(?P<qty>(\d))\s+(?P<uom>\w+)\s+(?P<price_unit>(\d+.\d+))\s+(?P<discount>\d+.\d+)
- ---(?P<line_note>.*ITEMS)---
(I will take care of no 2, as it is in my pipeline to provide a new real invoice)