invoice2data icon indicating copy to clipboard operation
invoice2data copied to clipboard

Having trouble trying to make look ahead and behind work in amount for regex

Open AboveTheHeavens opened this issue 5 years ago • 3 comments

So, I'm trying to extract the amount from the invoice but I need to use look ahead to get the correct amount.

Here is the expression I'm using (?<=\w:\s)[\d+\.]{0,}\d+,\d*(?=\s)

It's supposed to match something like: GESAMT: 9,95

The bold part, I've tested it online at regex101 and it's working properly there (I did use the python flavor while testing).

But I keep getting regexp for field amount didn't match warning.

Can anyone tell me what I can do to fix it? If not then at least let me know if it's something to do with my regex or with the library?

AboveTheHeavens avatar May 22 '20 01:05 AboveTheHeavens

Templates can be set to remove all white space because it generally makes matching more reliable. Maybe that's related to your issue?

Your regex also looks needlessly complicated. I'd start by simplifying it a bit and looking at the debug output from invoice2data to see the actual extracted text.

m3nu avatar May 22 '20 02:05 m3nu

@m3nu Saw your reply a bit late, I ended up going for a custom parsing for that (using pdfplumber and manually finding the string)

I know it's not directly related to my problem but if you don't mind telling me, How would I access the debug output?

AboveTheHeavens avatar May 22 '20 05:05 AboveTheHeavens

Run the command below in the command prompt, replacing my_invoice.pdf with the name of your invoice.

invoice2data --debug my_invoice.pdf

C-Maxim avatar May 22 '20 14:05 C-Maxim

Thanks @C-Maxim for pointing out --debug option.

As for regex for your case I'd suggest something much simpler like:

amount: GESAMT:\s*(\d+,\d+)\s*€

rmilecki avatar Jan 22 '23 21:01 rmilecki