python-OCR icon indicating copy to clipboard operation
python-OCR copied to clipboard

Converting invoice pdf to image, image to text and then get, from the text, invoice informations like invoice number or vendor name

python-OCR

In accounting, working with thousands of vendors is quite challenging when it comes to search invoices by invoice number between scanned documents.

Text invoices contain variety of information such as product names, VAT, product prices, vendor or customer names, tax information, the date of the transaction etc. The process of reading text from images is called Object Character Recognition since characters in images are essentially treated as objects.

In this repository, i have gone trough some ways de convert pdf to images using python. The, we can read text from these images. A little further content extraction is not provided here

#Prerequistes

  • Tesseract: https://github.com/tesseract-ocr/tesseract
  • ImageMagick: https://github.com/ImageMagick/ImageMagick
  • ghostscript: https://www.ghostscript.com/download/gsdnld.html

#Bibliographie

  • https://hypi.io/2019/10/29/reading-text-from-invoice-images-with-python/
  • invoice2data, a python library: https://medium.com/version-1/my-experience-extracting-invoice-data-using-invoice2data-in-python-1c6450fa001f
  • https://datascience.stackexchange.com/questions/33231/using-python-and-machine-learning-to-extract-information-from-an-invoice-inital
  • using pdf2image ans easyocr: https://www.youtube.com/watch?v=bcmEMcEzV9M
  • some insight of my solution: https://datascience.stackexchange.com/questions/33231/using-python-and-machine-learning-to-extract-information-from-an-invoice-inital
  • crreate rectangles in the pdf file using ocrmypdf: https://www.youtube.com/watch?app=desktop&v=glJi3LBgn9U
  • a similar project in C#: https://github.com/robela/OCR-Invoice
  • veryfi: an excellent API to get informations from an invoice or receipt: https://www.linkedin.com/pulse/extract-data-from-receipt-invoice-5-lines-ofcode-dmitry-birulia/?articleId=6676262454793764864
  • Others: https://hypi.io/2019/10/29/reading-text-from-invoice-images-with-python/;

#More ressources

  • projects: https://aihubprojects.com/handwriting-recognition-using-cnn-ai-projects/
  • https://nanonets.com/blog/handwritten-character-recognition/
  • https://aihubprojects.com/handwriting-recognition-using-cnn-ai-projects/
  • https://datascience.stackexchange.com/questions/33231/using-python-and-machine-learning-to-extract-information-from-an-invoice-inital

#more on tesseract https://learnopencv.com/deep-learning-based-text-recognition-ocr-using-tesseract-and-opencv/ https://learnopencv.com/category/text-recognition/

#datasets

  • https://www.kaggle.com/dromosys/handwriting-recognition-cnn
    • https://www.kaggle.com/tejasreddy/iam-handwriting-top50