invoice2data icon indicating copy to clipboard operation
invoice2data copied to clipboard

fields are missing in xml export

Open christianroeser opened this issue 4 years ago • 16 comments

Hi,

it seems that with the xml exports the most fields are missing. Only currency, amount, date, desc and issuer are exported. Invoice number and other fields are only with <fieldname> shown.

Bye Christian

christianroeser avatar Aug 13 '20 15:08 christianroeser

Does this still happen even after #291 was merged?

m3nu avatar Aug 14 '20 00:08 m3nu

Yes, the fields are included but empty. They are shown as

xml

christianroeser avatar Aug 14 '20 08:08 christianroeser

I see. Is this intended, @sudeepjd, can you reproduce it? If yes, the last PR should be amended.

m3nu avatar Aug 14 '20 08:08 m3nu

I closed #291 as there was a better PR at #287. I dont think my PR was merged into master . #287 was merged 4 days ago.. I forked the repo and merged #287 into my fork because I wanted to use the XML export.

I will recheck though...

sudeepjd avatar Aug 14 '20 08:08 sudeepjd

Right, sorry. @rmilecki, can you see the issue mentioned here?

m3nu avatar Aug 14 '20 09:08 m3nu

Sure!

@christianroeser: I'm a bit confused by "most fields are missing" while your output shows that only invoice_number and kundennr are missing. When testing the latest master branch with my invoices and templates I can't reproduce even that.

Can you please provide output of

invoice2data --debug --output-format xml <invoice>.pdf
cat invoices-output.xml

and maybe your .tpl file as well?

rmilecki avatar Aug 14 '20 10:08 rmilecki

It wouldn't be surprised if the mistake is mine. First I installed invoice2data via pip and swaped the to_xml.py with the actual file from master. Now I have uninstalled it and (hopefully) installed master via pip install git+https://github.com/invoice-x/invoice2data.git#egg=invoice2data. I still get the same results.... :(

DEBUG:invoice2data.extract.invoice_template:END optimized_str ==========================
DEBUG:invoice2data.extract.invoice_template:Date parsing: languages=[] date_formats=['%d.%m.%Y']
DEBUG:invoice2data.extract.invoice_template:Float parsing: decimal separator=,
DEBUG:invoice2data.extract.invoice_template:keywords=['ALLNET GmbH Computersysteme', 'DE 128 214 294']
DEBUG:invoice2data.extract.invoice_template:{'date_formats': ['%d.%m.%Y'], 'lowercase': False, 'decimal_separator': ',', 'currency': 'EUR', 'replace': [], 'languages': [], 'remove_whitespace': False, 'remove_accents': False}
DEBUG:invoice2data.extract.invoice_template:field=amount | regexp=Endsumme\s+\EUR\s+([0-9]+\,\d{2})
DEBUG:invoice2data.extract.invoice_template:res_find=[u'93,83']
DEBUG:invoice2data.extract.invoice_template:field=invoice_number | regexp=Nummer\s+([0-9]+)
DEBUG:invoice2data.extract.invoice_template:res_find=[u'1234567', u'1234567', u'1234567']
DEBUG:invoice2data.extract.invoice_template:field=date | regexp=Lieferdatum\s+(\d{2}\.\d{2}\.\d{4})
DEBUG:invoice2data.extract.invoice_template:res_find=[u'21.01.2020', u'21.01.2020', u'21.01.2020']
DEBUG:invoice2data.extract.invoice_template:result of date parsing=2020-01-21 00:00:00
DEBUG:invoice2data.extract.invoice_template:field=kundennr | regexp=Kunden-Nr.\s+(\d{5})
DEBUG:invoice2data.extract.invoice_template:res_find=[u'12345', u'12345', u'12345']
DEBUG:invoice2data.extract.invoice_template:{'currency': 'EUR', 'amount': 93.83, 'date': datetime.datetime(2020, 1, 21, 0, 0), 'invoice_number': u'1234567', 'desc': 'Invoice from ALLNET GmbH Computersysteme', 'kundennr': u'12345', 'issuer': 'ALLNET GmbH Computersysteme'}
INFO:invoice2data.main:{'currency': 'EUR', 'amount': 93.83, 'date': datetime.datetime(2020, 1, 21, 0, 0), 'invoice_number': u'1234567', 'desc': 'Invoice from ALLNET GmbH Computersysteme', 'kundennr': u'12345', 'issuer': 'ALLNET GmbH Computersysteme'}
<?xml version="1.0" ?>
<data>
  <item id="1">
    <currency>EUR</currency>
    <amount>93.83</amount>
    <date>2020-01-21</date>
    <invoice_number/>
    <desc>Invoice from ALLNET GmbH Computersysteme</desc>
    <kundennr/>
    <issuer>ALLNET GmbH Computersysteme</issuer>
  </item>
</data>
issuer: ALLNET GmbH Computersysteme
fields:
  amount: Endsumme\s+\EUR\s+([0-9]+\,\d{2})
  invoice_number: Nummer\s+([0-9]+)
  date: Lieferdatum\s+(\d{2}\.\d{2}\.\d{4})
  kundennr: Kunden-Nr.\s+(\d{5})
keywords:
- ALLNET GmbH Computersysteme
- DE 128 214 294
options:
  decimal_separator: ','
  date_formats:
    - '%d.%m.%Y'

christianroeser avatar Aug 14 '20 15:08 christianroeser

Thanks, I'm wondering if it's caused by your regular expressions providing few matches. I'll do more testing locally and try to fix that!

rmilecki avatar Aug 17 '20 05:08 rmilecki

@rmilecki: Thank you! I just did a quick test. The invoice was originally a three page invoice, so I processed just the last page. Invoice2data now finds only one match but the result is still the same.

christianroeser avatar Aug 17 '20 08:08 christianroeser

@christianroeser: I couldn't reproduce this problem using any of my invoices or templates. I decided to fake ALLNET invoice and use your template.

Your template didn't work for me initially due to the:

re.error: bad escape \E at position 11

so I replaced \EUR with EUR. Then I parsed my faked invoice (attached as allnet.pdf). It seems to work just fine for me.

> invoice2data --template-folder tpl --debug --output-format xml allnet.pdf
DEBUG:invoice2data.extract.invoice_template:END optimized_str ==========================
DEBUG:invoice2data.extract.invoice_template:Date parsing: languages=[] date_formats=['%d.%m.%Y']
DEBUG:invoice2data.extract.invoice_template:Float parsing: decimal separator=,
DEBUG:invoice2data.extract.invoice_template:keywords=['ALLNET GmbH Computersysteme', 'DE 128 214 294']
DEBUG:invoice2data.extract.invoice_template:{'remove_whitespace': False, 'remove_accents': False, 'lowercase': False, 'currency': 'EUR', 'date_formats': ['%d.%m.%Y'], 'languages': [], 'decimal_separator': ',', 'replace': []}
DEBUG:invoice2data.extract.invoice_template:field=amount | regexp=Endsumme\s+EUR\s+([0-9]+\,\d{2})
DEBUG:invoice2data.extract.invoice_template:res_find=['93,83', '93,83', '93,83']
DEBUG:invoice2data.extract.invoice_template:field=invoice_number | regexp=Nummer\s+([0-9]+)
DEBUG:invoice2data.extract.invoice_template:res_find=['1234567', '1234567', '1234567']
DEBUG:invoice2data.extract.invoice_template:field=date | regexp=Lieferdatum\s+(\d{2}\.\d{2}\.\d{4})
DEBUG:invoice2data.extract.invoice_template:res_find=['21.01.2020', '21.01.2020', '21.01.2020']
DEBUG:invoice2data.extract.invoice_template:result of date parsing=2020-01-21 00:00:00
DEBUG:invoice2data.extract.invoice_template:field=kundennr | regexp=Kunden-Nr.\s+(\d{5})
DEBUG:invoice2data.extract.invoice_template:res_find=['12345', '12345', '12345']
DEBUG:invoice2data.extract.invoice_template:{'issuer': 'ALLNET GmbH Computersysteme', 'amount': 93.83, 'invoice_number': '1234567', 'date': datetime.datetime(2020, 1, 21, 0, 0), 'kundennr': '12345', 'currency': 'EUR', 'desc': 'Invoice from ALLNET GmbH Computersysteme'}
INFO:invoice2data.main:{'issuer': 'ALLNET GmbH Computersysteme', 'amount': 93.83, 'invoice_number': '1234567', 'date': datetime.datetime(2020, 1, 21, 0, 0), 'kundennr': '12345', 'currency': 'EUR', 'desc': 'Invoice from ALLNET GmbH Computersysteme'}
> cat invoices-output.xml 
<?xml version="1.0" ?>
<data>
  <item id="1">
    <issuer>ALLNET GmbH Computersysteme</issuer>
    <amount>93.83</amount>
    <invoice_number>1234567</invoice_number>
    <date>2020-01-21</date>
    <kundennr>12345</kundennr>
    <currency>EUR</currency>
    <desc>Invoice from ALLNET GmbH Computersysteme</desc>
  </item>
</data>

Can this be some problem with Python version or system settings (locale or similar)?

rmilecki avatar Sep 20 '20 12:09 rmilecki

I use 2.7.18. What is your version @christianroeser ?

rmilecki avatar Sep 20 '20 13:09 rmilecki

HI, I've got a similar issue. I'm new to the project, which have download yesterday. I'm parsing a pay slip. The field "ferie" is parsed as 200,00 as expected, is extracted as expected in CSV mode but missing in XML mode.

Attached Logs and file output. Below command and template. Best regards.

log.zip

CSV_XML.zip

invoice2data --debug C:\Users\fpalmieri\Documents\test_cedolino_engineering_decrittato.PDF --output-format xml --output-name C:\Users\fpalmieri\Documents\test_cedolino_engineering.xml pause

# -*- coding: utf-8 -*-
issuer: Engineering
fields:
  invoice_number: PERIODO\s+([0-9]{2}/[0-9]{4})
  amount: GG-Det.\s+([\.\,0-9]+)
  date: ([0-9]{2}\.[0-9]{2}\.[0-9]{4})\s+ARR\.MESE
  ferie: Spett.\s+([\.\,0-9]+)
keywords:
   - Engineering
required_fields:
   - invoice_number
   - amount
   - date
   - ferie
options:
  currency: EUR
  date_formats:
    - '%d.%m.%Y'
  languages:
    - it
  decimal_separator: ','

FrancescoPalmieri avatar Nov 28 '20 14:11 FrancescoPalmieri

Hi, I am missing also field names. Only default fields are listed in xml output (date, desc, currency, amount). Export to json is fine but with additional log message: <class 'module'> <module 'json' from '/usr/local/Cellar/[email protected]/3.9.2_4/Frameworks/Python.framework/Versions/3.9/lib/python3.9/json/init.py'>

gregoribic avatar Apr 05 '21 09:04 gregoribic

@FrancescoPalmieri: do you use 0.3.6 or newer?

I've prepared sample PDF matching your template, see: cedolino.pdf

Parsing it works perfectly fine, see:

> invoice2data --debug --template-folder tpl --output-format xml cedolino.pdf
DEBUG:invoice2data.main:START pdftotext result ===========================
DEBUG:invoice2data.main:Engineering

PERIODO 11/2020
GG-Det. 2142.00
27.11.2020 ARR.MESE
Spettx 200,00


DEBUG:invoice2data.main:END pdftotext result =============================
DEBUG:invoice2data.main:Testing 178 template files
DEBUG:invoice2data.extract.invoice_template:Template: MyFlipkart.yml. Failed to match all keywords.
DEBUG:invoice2data.extract.invoice_template:Template: allnet.yml. Failed to match all keywords.
DEBUG:invoice2data.extract.invoice_template:Template: cedolino.yml. Keywords matched. No exclude keywords found.
DEBUG:invoice2data.extract.invoice_template:START optimized_str ========================
DEBUG:invoice2data.extract.invoice_template:Engineering

PERIODO 11/2020
GG-Det. 2142.00
27.11.2020 ARR.MESE
Spettx 200,00


DEBUG:invoice2data.extract.invoice_template:END optimized_str ==========================
DEBUG:invoice2data.extract.invoice_template:Date parsing: languages=['it'] date_formats=['%d.%m.%Y']
DEBUG:invoice2data.extract.invoice_template:Float parsing: decimal separator=,
DEBUG:invoice2data.extract.invoice_template:keywords=['Engineering']
DEBUG:invoice2data.extract.invoice_template:{'remove_whitespace': False, 'remove_accents': False, 'lowercase': False, 'currency': 'EUR', 'date_formats': ['%d.%m.%Y'], 'languages': ['it'], 'decimal_separator': ',', 'replace': []}
DEBUG:invoice2data.extract.invoice_template:field=invoice_number | regexp=PERIODO\s+([0-9]{2}/[0-9]{4})
DEBUG:invoice2data.extract.invoice_template:field=amount | regexp=GG-Det.\s+([\.\,0-9]+)
DEBUG:invoice2data.extract.invoice_template:field=date | regexp=([0-9]{2}\.[0-9]{2}\.[0-9]{4})\s+ARR\.MESE
DEBUG:invoice2data.extract.invoice_template:result of date parsing=2020-11-27 00:00:00
DEBUG:invoice2data.extract.invoice_template:field=ferie | regexp=Spett.\s+([\.\,0-9]+)
DEBUG:invoice2data.extract.invoice_template:{'issuer': 'Engineering', 'invoice_number': '11/2020', 'amount': 214200.0, 'date': datetime.datetime(2020, 11, 27, 0, 0), 'ferie': '200,00', 'currency': 'EUR', 'desc': 'Invoice from Engineering'}
INFO:invoice2data.main:{'issuer': 'Engineering', 'invoice_number': '11/2020', 'amount': 214200.0, 'date': datetime.datetime(2020, 11, 27, 0, 0), 'ferie': '200,00', 'currency': 'EUR', 'desc': 'Invoice from Engineering'}
> cat invoices-output.xml 
<?xml version="1.0" ?>
<data>
  <item id="1">
    <issuer>Engineering</issuer>
    <invoice_number>11/2020</invoice_number>
    <amount>214200.0</amount>
    <date>2020-11-27</date>
    <ferie>200,00</ferie>
    <currency>EUR</currency>
    <desc>Invoice from Engineering</desc>
  </item>
</data>

As you can see there is <ferie>200,00</ferie> in my XML.

I've tried Python 3.6 and 3.9. Both work fine.

I'm really out of ideas why some people can't see all fields in the XML output.

rmilecki avatar Sep 12 '21 20:09 rmilecki

@gregoribic: did you try 0.3.6 (or newer) release?

rmilecki avatar Sep 12 '21 20:09 rmilecki

I modified my clone. Will check the master branch.

gregoribic avatar Nov 13 '21 08:11 gregoribic

@gregoribic: can you provide update on this, please?

rmilecki avatar Aug 29 '22 20:08 rmilecki