pdf_statement_reader
pdf_statement_reader copied to clipboard
Parse Errors
Done at https://github.com/marlanperumal/pdf_statement_reader/issues/33#issuecomment-865017265 (pdf at #30).
Date Transaction Debit Credit Balance
01 Jul 2018 OPENING BALANCE $1,384.89 CR
01 Jul DEBIT INTEREST CHARGED on this account
to June 30. 2018 is $0.11
02 Jul Transfer to another Bank NetBank 372.00 $1,012.89 CR
Rob Ubank Transfer
- ~Amount starting/ending/contains CR/DB~
- ~No year in date~
- ~Skip lines starting/ending/contains Balance/Forward~
- ~Concatenate wrapped Transaction description~
- ~Allow currency sign~
- ~Allow thousands separator~
Using https://github.com/marlanperumal/pdf_statement_reader/issues/34#issuecomment-860212042
(pdf_statement_reader) E:\venv\pdf_statement_reader>psr pdf2csv MockTemplate.pdf
Converted MockTemplate.pdf and saved as MockTemplate.csv
Date,Transaction Description,Transaction Detail,Debit Amount,Credit Amount,Balance
,"Lorem ipsum dolor sit amet,","consectetur adipiscing elit, sed",-197.60,,802.40
,tristique senectus. Quisque,,-976.14,,192.73
,egestas diam in arcu curs,euismod quis. Eget mi proin sed,-904.80,,-712.07
,libero enim sed. Dignissim enim,,,327.70,-831.74
,sit amet venenatis urna. Ipsum a,,,362.86,-468.88
,arcu cursus vitae congue mauris,,-455.41,,-924.29
,rhoncus aenean,,,169.89,-754.40
,consectetur adipiscing,,,462.19,-292.21
,arcu felis bibendum ut tristique,,,452.75,160.54
,et egestas. Sagittis vitae et leos,,,171.54,332.08
,duis ut diam. Tempor commo,,,237.17,569.25
,ullamcorper,sed arcu non. Facilisis volutpat,,392.18,961.43
,est velil egestas. Vel facilisi,,-786.68,,595.30
- Date dropped with format used
- Transactions with thousands separator in Balance dropped (terminates with exception error if thousands separator in Debit/Credit Amount)
- Where next transaction dropped, next Transaction Description shown in previous Transaction Detail
- Only first page processed
Can you provide the full json config that you used? In the issue that you link, I only see the layout section
(pdf_statement_reader) E:\venv\pdf_statement_reader>psr pdf2csv MockTemplate.pdf
[default: za.absa.cheque]
Oh I see did you just use the default config?
{
"layout": {
"default": {
"area": [280, 27, 763, 576],
"columns": [83, 264, 344, 425, 485, 570]
},
"first": {
"area": [480, 27, 763, 576],
"columns": [83, 264, 344, 425, 485, 570]
}
},
"columns": {
"trans_date": "Date",
"trans_type": "Transaction Description",
"trans_detail": "Transaction Detail",
"debit": "Debit Amount",
"credit": "Credit Amount",
"balance": "Balance"
},
"order": [
"trans_date",
"trans_type",
"trans_detail",
"debit",
"credit",
"balance"
],
"cleaning": {
"numeric": ["debit", "credit", "balance"],
"date": ["trans_date"],
"date_format": "%d/%m/%Y",
"trans_detail": "below",
"dropna": ["balance"]
}
}
You should create a new json config file, updated with the correct values, store it in pdf_statement_reader/config/au/citi/mock.json or similar and call it with
psr pdf2csv MockTemplate.pdf --config=au.citi.mock
lol now I finally get it, we have a cross-communication issue.
For now the only config supported is for Cheque account statements from Absa bank in South Africa.
You are suggesting a new config file is required but I was trying to create a sample pdf to use with default setting.
Testing pdf(s) are also needed. Maybe pdf_statement_reader/config/00/test/mock.json
Read MockTemplate.pdf date with "date_format": "%d/%m/%y" but saved in csv file as "yyyy/mm/dd"
What do you think about changing JSON cleaning as follows?
"drop": ["trans_type", "(^(?i)BALANCE.*(?i)FORWARD$)"]
What do you think about changing JSON cleaning as follows?
"drop": ["trans_type", "(^(?i)BALANCE.*(?i)FORWARD$)"]
To be honest I'd prefer to be explicit on the column names, also because of my previously mentioned aversion to regex. I might be persuaded to go for some additional keys like
drop_if_prefixdrop_if_suffixdrop_if_contains
Read
MockTemplate.pdfdate with"date_format": "%d/%m/%y"but saved in csv file as "yyyy/mm/dd"
Are you reading the csv file into Excel? (hence the format change). The desired behaviour is actually to always output the date in iso-format YYYY-MM-DD no matter what the input format.
lol now I finally get it, we have a cross-communication issue.
For now the only config supported is for Cheque account statements from Absa bank in South Africa.
You are suggesting a new config file is required but I was trying to create a sample pdf to use with default setting.
Testing pdf(s) are also needed. Maybe
pdf_statement_reader/config/00/test/mock.json
I'll try to create a config for your mock pdf later this week. It'll actually be a a really good base to use for testing of all the other functionality
config > [country code] > [bank] > [statement type].json
Country code is irrelevant as bank format normally consistent across countries
explicit column names
json file already uses field coding names, more portable as not specific to actual statements
The desired behaviour is actually to always output the date in iso-format
YYYY-MM-DDno matter what the input format.
No, output format specified by software consuming file
I'll try to create a config for your mock pdf later this week
Only difference from default is "date_format": "%d/%m/%Y" -> "date_format": "%d/%m/%y". Do one for CbaBankStatement.pdf.
CbaBankStatement.pdf fully supported on https://github.com/flywire/pdf_statement_reader/commits/cba with extended functionality.
CbaBankStatement.csv
Date,Transaction,Debit,Credit,Balance
01/07/18,2018 Opening Balance,,,1384.89
01/07/18,Debit Interest Charged On This Account To June 30. 2018 Is $0.11,,,
02/07/18,Transfer To Another Bank Netbank Rob Ubank Transfer,372.00,,1012.89
02/07/18,Transfer From Mckay Mj Mick Mckay - Neck Hackles.,,43.80,
02/07/18,Direct Debit 000115 Colonial Mutual 1200180874627741,25.00,,
02/07/18,Loan Repayment Ln Repay694259331,280.00,,751.69
03/07/18,Petstock Heathmont P Heathmont Aus Card Xx4521 Value Date: 01/07/2018,32.99,,718.70
03/07/18,Woolworths 3149 Eastla Ringwood Aus Card Xx4521 Value Date: 01/07/2018,134.26,,584.44
03/07/18,Heathmont Iga Heathmont Aus Card Xx4521 Value Date: 30/06/2018,30.00,,554.44
03/07/18,Post Heathmont Lpohe Heathmont Ca Aus Card Xx4521 Value Date: 29/06/2018,47.57,,506.87
03/07/18,Five Star Music Pl Ringwood Vi Aus Card Xx4521 Value Date: 28/06/2018,50.00,,456.87
03/07/18,Post Heathmont Lpohe Heathmont Ca Aus Card Xx4521 Value Date: 28/06/2018,55.65,,401.22
Note: Original pdf file has 2 invalid balances because . is used as both decimal and thousands separator on those values so they are ignored.
Issue remains for misidentified data after end of transaction data as previously described. Suggest support dropping all records after user described string.
Is this still an issue? I've updated the project (including migrating to uv for python and package management) and fixed a couple bugs along the way.
I'm picking up this project again after a bit of a hiatus. If I don't see any response on this issue within 2 weeks, it will be closed, so that only current issues will be present and can be addressed.
Whatever
Ok - closing this issue then