pdf_statement_reader icon indicating copy to clipboard operation
pdf_statement_reader copied to clipboard

Parse Errors

Open flywire opened this issue 4 years ago • 13 comments

Done at https://github.com/marlanperumal/pdf_statement_reader/issues/33#issuecomment-865017265 (pdf at #30).

Date    Transaction                             Debit   Credit  Balance
01 Jul  2018 OPENING BALANCE                                    $1,384.89 CR
01 Jul  DEBIT INTEREST CHARGED on this account
        to June 30. 2018 is $0.11
02 Jul  Transfer to another Bank NetBank        372.00          $1,012.89 CR
        Rob Ubank Transfer
  1. ~Amount starting/ending/contains CR/DB~
  2. ~No year in date~
  3. ~Skip lines starting/ending/contains Balance/Forward~
  4. ~Concatenate wrapped Transaction description~
  5. ~Allow currency sign~
  6. ~Allow thousands separator~

flywire avatar May 30 '21 13:05 flywire

Using https://github.com/marlanperumal/pdf_statement_reader/issues/34#issuecomment-860212042

(pdf_statement_reader) E:\venv\pdf_statement_reader>psr pdf2csv MockTemplate.pdf
Converted MockTemplate.pdf and saved as MockTemplate.csv
Date,Transaction Description,Transaction Detail,Debit Amount,Credit Amount,Balance
,"Lorem ipsum dolor sit amet,","consectetur adipiscing elit, sed",-197.60,,802.40
,tristique senectus. Quisque,,-976.14,,192.73
,egestas diam in arcu curs,euismod quis. Eget mi proin sed,-904.80,,-712.07
,libero enim sed. Dignissim enim,,,327.70,-831.74
,sit amet venenatis urna. Ipsum a,,,362.86,-468.88
,arcu cursus vitae congue mauris,,-455.41,,-924.29
,rhoncus aenean,,,169.89,-754.40
,consectetur adipiscing,,,462.19,-292.21
,arcu felis bibendum ut tristique,,,452.75,160.54
,et egestas. Sagittis vitae et leos,,,171.54,332.08
,duis ut diam. Tempor commo,,,237.17,569.25
,ullamcorper,sed arcu non. Facilisis volutpat,,392.18,961.43
,est velil egestas. Vel facilisi,,-786.68,,595.30
  1. Date dropped with format used
  2. Transactions with thousands separator in Balance dropped (terminates with exception error if thousands separator in Debit/Credit Amount)
  3. Where next transaction dropped, next Transaction Description shown in previous Transaction Detail
  4. Only first page processed

flywire avatar Jun 14 '21 00:06 flywire

Can you provide the full json config that you used? In the issue that you link, I only see the layout section

marlanperumal avatar Jun 14 '21 20:06 marlanperumal

(pdf_statement_reader) E:\venv\pdf_statement_reader>psr pdf2csv MockTemplate.pdf

[default: za.absa.cheque]

flywire avatar Jun 14 '21 22:06 flywire

Oh I see did you just use the default config?

{
    "layout": {
        "default": {
            "area": [280, 27, 763, 576],
            "columns": [83, 264, 344, 425, 485, 570]
        },
        "first": {
            "area": [480, 27, 763, 576],
            "columns": [83, 264, 344, 425, 485, 570]
        }
    },
    "columns": {
        "trans_date": "Date",
        "trans_type": "Transaction Description",
        "trans_detail": "Transaction Detail",
        "debit": "Debit Amount",
        "credit": "Credit Amount",
        "balance": "Balance"
    },
    "order": [
        "trans_date",
        "trans_type",
        "trans_detail",
        "debit",
        "credit",
        "balance"
    ],
    "cleaning": {
        "numeric": ["debit", "credit", "balance"],
        "date": ["trans_date"],
        "date_format": "%d/%m/%Y",
        "trans_detail": "below",
        "dropna": ["balance"]
    }
}

You should create a new json config file, updated with the correct values, store it in pdf_statement_reader/config/au/citi/mock.json or similar and call it with

psr pdf2csv MockTemplate.pdf --config=au.citi.mock

marlanperumal avatar Jun 14 '21 22:06 marlanperumal

lol now I finally get it, we have a cross-communication issue.

For now the only config supported is for Cheque account statements from Absa bank in South Africa.

You are suggesting a new config file is required but I was trying to create a sample pdf to use with default setting.

Testing pdf(s) are also needed. Maybe pdf_statement_reader/config/00/test/mock.json

flywire avatar Jun 14 '21 23:06 flywire

Read MockTemplate.pdf date with "date_format": "%d/%m/%y" but saved in csv file as "yyyy/mm/dd"

flywire avatar Jun 16 '21 13:06 flywire

What do you think about changing JSON cleaning as follows?

  • "drop": ["trans_type", "(^(?i)BALANCE.*(?i)FORWARD$)"]

flywire avatar Jun 20 '21 09:06 flywire

What do you think about changing JSON cleaning as follows?

  • "drop": ["trans_type", "(^(?i)BALANCE.*(?i)FORWARD$)"]

To be honest I'd prefer to be explicit on the column names, also because of my previously mentioned aversion to regex. I might be persuaded to go for some additional keys like

  • drop_if_prefix
  • drop_if_suffix
  • drop_if_contains

marlanperumal avatar Jun 20 '21 10:06 marlanperumal

Read MockTemplate.pdf date with "date_format": "%d/%m/%y" but saved in csv file as "yyyy/mm/dd"

Are you reading the csv file into Excel? (hence the format change). The desired behaviour is actually to always output the date in iso-format YYYY-MM-DD no matter what the input format.

marlanperumal avatar Jun 20 '21 10:06 marlanperumal

lol now I finally get it, we have a cross-communication issue.

For now the only config supported is for Cheque account statements from Absa bank in South Africa.

You are suggesting a new config file is required but I was trying to create a sample pdf to use with default setting.

Testing pdf(s) are also needed. Maybe pdf_statement_reader/config/00/test/mock.json

I'll try to create a config for your mock pdf later this week. It'll actually be a a really good base to use for testing of all the other functionality

marlanperumal avatar Jun 20 '21 10:06 marlanperumal

config > [country code] > [bank] > [statement type].json

Country code is irrelevant as bank format normally consistent across countries

explicit column names

json file already uses field coding names, more portable as not specific to actual statements

The desired behaviour is actually to always output the date in iso-format YYYY-MM-DD no matter what the input format.

No, output format specified by software consuming file

I'll try to create a config for your mock pdf later this week

Only difference from default is "date_format": "%d/%m/%Y" -> "date_format": "%d/%m/%y". Do one for CbaBankStatement.pdf.

flywire avatar Jun 20 '21 23:06 flywire

CbaBankStatement.pdf fully supported on https://github.com/flywire/pdf_statement_reader/commits/cba with extended functionality.

CbaBankStatement.csv

Date,Transaction,Debit,Credit,Balance
01/07/18,2018 Opening Balance,,,1384.89
01/07/18,Debit Interest Charged On This Account To June 30. 2018 Is $0.11,,,
02/07/18,Transfer To Another Bank Netbank Rob Ubank Transfer,372.00,,1012.89
02/07/18,Transfer From Mckay Mj Mick Mckay - Neck Hackles.,,43.80,
02/07/18,Direct Debit 000115 Colonial Mutual 1200180874627741,25.00,,
02/07/18,Loan Repayment Ln Repay694259331,280.00,,751.69
03/07/18,Petstock Heathmont P Heathmont Aus Card Xx4521 Value Date: 01/07/2018,32.99,,718.70
03/07/18,Woolworths 3149 Eastla Ringwood Aus Card Xx4521 Value Date: 01/07/2018,134.26,,584.44
03/07/18,Heathmont Iga Heathmont Aus Card Xx4521 Value Date: 30/06/2018,30.00,,554.44
03/07/18,Post Heathmont Lpohe Heathmont Ca Aus Card Xx4521 Value Date: 29/06/2018,47.57,,506.87
03/07/18,Five Star Music Pl Ringwood Vi Aus Card Xx4521 Value Date: 28/06/2018,50.00,,456.87
03/07/18,Post Heathmont Lpohe Heathmont Ca Aus Card Xx4521 Value Date: 28/06/2018,55.65,,401.22

Note: Original pdf file has 2 invalid balances because . is used as both decimal and thousands separator on those values so they are ignored.

flywire avatar Jun 21 '21 13:06 flywire

Issue remains for misidentified data after end of transaction data as previously described. Suggest support dropping all records after user described string.

flywire avatar Jun 24 '21 13:06 flywire

Is this still an issue? I've updated the project (including migrating to uv for python and package management) and fixed a couple bugs along the way.

I'm picking up this project again after a bit of a hiatus. If I don't see any response on this issue within 2 weeks, it will be closed, so that only current issues will be present and can be addressed.

marlanperumal avatar Jan 16 '25 10:01 marlanperumal

Whatever

flywire avatar Jan 16 '25 11:01 flywire

Ok - closing this issue then

marlanperumal avatar Jan 24 '25 09:01 marlanperumal