erpnext_ocr icon indicating copy to clipboard operation
erpnext_ocr copied to clipboard

selective ocr to extract key/value data

Open raveslave opened this issue 4 years ago • 11 comments

wouldn't it be cool to offer this feature. basically allow to draw an overlay that helps find key, value that can then be mapped to the relevant document type -> field in erpnext!

image

raveslave avatar Nov 23 '19 12:11 raveslave

Hi @raveslave ,

Thanks for sharing this idea. It's really interesting and looks really cool.

I have a few doubts though about the usability as it seems rather complicated to develop or even to use.

  1. First, relying on the position of text seems like it would break easily if format or invoice "source" changes (shops do not always have the same invoice layout or it may change over time).
  2. You would also have to manually map every text section to a DocType field each time you read a document. This would make it impossible to do bulk imports. Maybe the previous mappings could be stored to "guide" users on recurring invoices (kinda same position as an older), as you mentioned, but that could mean to store and read a lot of data to provide this.
  3. This is hardly applicable to multi-page PDF documents.

Though it seems hard to provide this, we will still take a look at it as this definitely goes in the direction we are aiming: importing / generating DocTypes from OCR.

For reference, we're currently more invested in a text based import using simple regular expressions or text processing libraries: https://appliedmachinelearning.blog/2018/06/30/performing-ocr-by-running-parallel-instances-of-tesseract-4-0-python/

We can keep this open to discuss further if you want to.

madmath03 avatar Nov 25 '19 17:11 madmath03

pls see comments:

  1. true, but the idea is not to rely on the rectangle position, rather have a template to teach the OCR tool to look for that same string. If it fails on a mandatory one, script should notify that manual attention is needed.

  2. My idea is that you only do this mapping once per supplier. disregaring ocr, most invoices will be PDF, so in that case, same principle would apply, but easier to implement.

  3. true, but most of the time, the thing you're after is the date & invoice-no to allow populating the bare minimum (mandatory fields) and later matching it to a PO

raveslave avatar Nov 25 '19 22:11 raveslave

re: tesseract
cool tech, have you tried it on a pile of random invoices? curious how it works and if there are ways to get parameterized data back from it.

raveslave avatar Nov 25 '19 22:11 raveslave

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Jan 24 '20 23:01 stale[bot]

anyone been looking into this lately?

raveslave avatar Mar 11 '21 21:03 raveslave

Hello,

I am currently looking for something like this to use with ERPNext. Converting scanned or email-received PDF purchase invoices to text (or even json) and with the needed data automatically creating a purchase invoice in ERPNext. Only with added functionality for uploading the PDF files from the email and attaching them (or link) to the relevant purchase invoice. I'm not a programmer... I'm on the financial side.. There are commercial solutions available for this functionality which means it is possible to create.

gio3166 avatar Apr 12 '21 12:04 gio3166

anyone been looking into this lately?

Hi @raveslave, unfortunately, we did not find the time to look any further into this.

madmath03 avatar Apr 12 '21 17:04 madmath03

just checking in, anyone willing to co-sponsor?

raveslave avatar Sep 18 '21 14:09 raveslave

I need to extract key value pairs from PDF tables

bharath-kumarn avatar Oct 12 '21 09:10 bharath-kumarn

@raveslave I need to extract key value pairs from PDF tables

bharath-kumarn avatar Oct 12 '21 09:10 bharath-kumarn