Extract-Data-From-PDF-In-Python icon indicating copy to clipboard operation
Extract-Data-From-PDF-In-Python copied to clipboard

Batch-convert pdf to text, extract data from pdf in python

Extract Data From PDF In Python

MIT License

In this project, we are going to batch-convert pdf files to text and extract data without using PyPDF2/4.

We're going to achieve that by:

  • Using PDFtoText converter from XPdf to convert pdf files to text files
  • Using regular expressions to extract data
  • Performing data cleaning using pandas
  • Exporting to Excel file

Why Not Use PyPDF2/4

Short Answer: I got this error:

TypeError: object of type 'IndirectObject' has no len()

Long Answer: If PyPDF4 had worked I would never have had a chance to explore other ways. I looked on StackOverflow however couldn't find a solution for this error. Obviously, there had to be someone with the same problem but there's no solution.

I was not willing to manually copy and paste the information from 52 of my payslips. Isn't that what programs are used for?

Table of Contents

  • Packages
  • Converting PDF To Text
  • Script Link

Packages

Converting PDF To Text

Converting PDF to text using Xpdf's pdftotext is really simple.

Using this command-line tool we can batch-convert PDFs to text files.

pdftotext source.pdf dest.txt

Script Link

Script Link: parse_payslips.py