ValueError when parsing number including currency symbol ($)
I have a source document where positive numbers are expressed as $1,200.00, and negative numbers like this: -$1,716.96.
For positive numbers, the field regex can ignore the $ sign, and the resulting value parses correctly. However, to pick up the minus sign, I need to capture the $ too, which causes the parser to fail.
I've used a replace configuration to substitute $ for `` in the interim, but it would be better if we could 1) concatenate capture groups automatically or 2) handle the currency symbol while parsing numbers.
What is the preferred approach?
DEBUG:root:field=amount_due | regexp=AMOUNT DUE:\s+(-?\$\d+?,?\d+\.\d+)
DEBUG:root:res_find=['-$1,716.96', '-$1,716.96']
Traceback (most recent call last):
File "/home/user/.virtualenvs/pdftest/bin/invoice2data", line 10, in <module>
sys.exit(main())
File "/home/user/.virtualenvs/pdftest/lib/python3.6/site-packages/invoice2data/main.py", line 194, in main
res = extract_data(f.name, templates=templates, input_module=input_module)
File "/home/user/.virtualenvs/pdftest/lib/python3.6/site-packages/invoice2data/main.py", line 93, in extract_data
return t.extract(optimized_str)
File "/home/user/.virtualenvs/pdftest/lib/python3.6/site-packages/invoice2data/extract/invoice_template.py", line 186, in extract
output[k] = self.parse_number(res_find[0])
File "/home/user/.virtualenvs/pdftest/lib/python3.6/site-packages/invoice2data/extract/invoice_template.py", line 106, in parse_number
return float(amount_pipe_no_thousand_sep.replace('|', '.'))
ValueError: could not convert string to float: '-$1716.96'
You could do a custom field that only picks up the minus-sign and merge it with the amount later.
Maybe we could allow regex parser to accept multiple capturing groups and let template specify how to handle them?
Something like:
amount:
parser: regex
regex: AMOUNT DUE:\s+(-?)\$(\d+?,?\d+\.\d+)
group: concat