invoice2data icon indicating copy to clipboard operation
invoice2data copied to clipboard

Added area support for fields

Open RossK1 opened this issue 3 years ago • 27 comments

Fields specifying a parser now also accept an "area" argument. This allows one to specify the page(s) and coordinates of the area to be extracted and then applies the specified parser to the results as normal :) Currently this only works for pdftotext but could be expanded to other converters fairly easily. Incorporated changes into all existing functions where possible to allow for ease of future maintainability. Tutorial was updated to show new option and its syntax.

Credit for the idea and base code goes to @kavinsharma in issue #305

RossK1 avatar Mar 17 '21 04:03 RossK1

@kavinsharma, I hope this does justice to your original idea :) Based on some of the comments on #305 I switched it into the parsers to allow people to extract areas and still be able to parse them using the other functionalities. @m3nu, let me know what you think :)

RossK1 avatar Mar 17 '21 04:03 RossK1

This should probably be a parser, if it only finds a single value in a specific area.

Difference between parser and plugin: https://github.com/invoice-x/invoice2data/blob/master/src/invoice2data/extract/plugins/interface.py#L4

m3nu avatar Mar 17 '21 04:03 m3nu

The code in this PR is not a parser currently, since it's not one file in the parsers package.

We added parsers and plugins to avoid cluttering the rest of the code too much.

Hope you guys can work together on this. Personally, I'd build on #305, since it's more neatly separated. I'll add some comments there.

m3nu avatar Mar 17 '21 05:03 m3nu

@m3nu My idea with having it as an argument rather than a plugin or separate parser is that it could be used in conjunction with any of the other parsers. E.g. if you need to pull out lines from a specific region, you can specify that area and the lines parser. If you need a single value, specify the area and the regex parser. This way any future parsers inherit the functionality automatically as the area extraction is checked before the parser is called :) It is also completely optional, so any existing templates without the area specified work as normal and if you want to search the entire document, you can just leave it out. Thoughts?

RossK1 avatar Mar 17 '21 05:03 RossK1

ok sure @m3nu , we will look at your comments and fix it up. @RossK1 I hope you will be fine in contributing to my PR #305 :)

kavinsharma avatar Mar 17 '21 05:03 kavinsharma

ok sure @m3nu , we will look at your comments and fix it up. @RossK1 I hope you will be fine in contributing to my PR #305 :)

For sure! If you need me to test or check something in it, let me know 😄 Even if #305 is implemented as a plugin, I think the functionality in #333 is still important to have as it allows the parsers to (optionally) be directed to act on a very specific area of the invoice. I have found this to be required when dealing with invoices where multiple pieces of information are presented beside each other on the page, but move in vertical relation to each other, causing pdftotext to return slightly mis-aligned lines for each pdf. Being able to clip out the part you want and still run regex with sums or lines on it is invaluable.

RossK1 avatar Mar 17 '21 12:03 RossK1

@m3nu, what do you think? Would this be a useful feature? You'll see from the extra code that if the "area" parameter is passed in a field, it simply re-runs pdftotext for the region specified and uses the extracted text for that specific parser. Other fields and other plugins are unaffected and yet this functionality can benefit any current and future parsers meaning contributors can focus on the parser instead of how to extract an area of the page.

RossK1 avatar Mar 18 '21 01:03 RossK1

@m3nu, what do you think? Would this be a useful feature? You'll see from the extra code that if the "area" parameter is passed in a field, it simply re-runs pdftotext for the region specified and uses the extracted text for that specific parser. Other fields and other plugins are unaffected and yet this functionality can benefit any current and future parsers meaning contributors can focus on the parser instead of how to extract an area of the page.

is it available for usage?if is how can we use ?

erkin98 avatar Mar 22 '21 13:03 erkin98

@m3nu any update on reviewing and merging this feature?

RossK1 avatar Apr 08 '21 04:04 RossK1

Didn't you want to make this as parser? It also needs a rebase.

m3nu avatar Apr 08 '21 05:04 m3nu

If we make it a separate parser, it loses the ability to leverage the other parsers. Being an optional input to any field, it simply narrows down the content being sent to any other parser. This way the other parsers do not need their own area support, it is handled up-front.

RossK1 avatar Apr 08 '21 14:04 RossK1

True, that's an argument for adding it on top and I tend to agree. Let me consider it again tomorrow.

m3nu avatar Apr 08 '21 15:04 m3nu

Yeah, as much as I would like to avoid cluttering up the code, this seems like a more elegant way of doing it without having every parser have to call some other area parser first.

RossK1 avatar Apr 08 '21 17:04 RossK1

@m3nu . Any updates on this PR?

Thanks!

RossK1 avatar May 31 '21 17:05 RossK1

I have this same requirement. Any update on this PR?

pranavshah01 avatar Jun 18 '21 01:06 pranavshah01

I would have preferred this as standalone parser, but see the tight linkage with pdftotext. So the solution you chose makes sense.

Please just add a few test cases using the existing (or new) sample files to make sure the feature keeps working in the future.

Then we can merge and release this feature.

m3nu avatar Jun 18 '21 05:06 m3nu

Any chance to have this feature available soon? I'm using invoice2data package and a feature like this would be very helpful. Cheers!

sergiuturus avatar Aug 26 '21 11:08 sergiuturus

@sergiuturus not yet. Been trying to find time to add the test cases m3nu requires.

RossK1 avatar Aug 26 '21 12:08 RossK1

Yes, some tests please. So we will know if this feature breaks in the future.

m3nu avatar Sep 13 '21 07:09 m3nu

And please close any competing/superseded area plugin PRs. We had a few of them.

m3nu avatar Sep 13 '21 07:09 m3nu

is there any update on this?

alexx-avdeev avatar Jan 08 '22 13:01 alexx-avdeev

  • [ ] Rebase and resolve conflicts
  • [ ] Add tests and test files (or reuse existing files)
  • [ ] Close competing or superseded PRs

m3nu avatar Jan 08 '22 13:01 m3nu

@m3nu For the tests, would it be sufficient to simply reuse the Amazon one? Update the template file (src/invoice2data/extract/templates/com/com.amazon.aws.yml) to include an area and then update the expected result file (tests/compare/AmazonWebServices.json) to reflect the area to be extracted? The test_content_json in tests/test_cli.py should pick up and compare the changes, right? Would this one test be enough or would you like to see similar extractions for multiple files? Thanks :)

RossK1 avatar Jan 09 '22 21:01 RossK1

Sure. No problem to reuse an existing template if you don't want to share your own documents.

The test should run the new code. If you reuse a document, please add a new test and comparison output, so the old test is still there.

m3nu avatar Jan 10 '22 08:01 m3nu

Some conflicts here.

m3nu avatar Jun 23 '22 13:06 m3nu

Hi! Any updates on this?

Thanks

Logic-Bits avatar Aug 30 '22 09:08 Logic-Bits

  • [ ] Rebase and resolve conflicts
  • [ ] Add tests and test files (or reuse existing files)
  • [ ] Close competing or superseded PRs

@RossK1 This PR has been sitting here for a while. Would be great if it could be included in the next release. Could you attend to the outstanding tasks?

bosd avatar Sep 25 '22 07:09 bosd

@alexis-via Great to see you back here! @m3nu Since there are approving reviews, can we still include this in the 0.4.0 release?

bosd avatar Nov 28 '22 21:11 bosd

As the author of this PR didn't answer messages for several months, I rebased and added a test in this new PR: https://github.com/invoice-x/invoice2data/pull/438

alexis-via avatar Nov 28 '22 23:11 alexis-via

Ok, this one can be closed now. Thanks guys, for all your efforts!!

bosd avatar Nov 29 '22 06:11 bosd