shine-jayakumar/Extract-Data-From-PDF-In-Python: Batch-convert pdf to text, extract data from pdf i...

Extract Data From PDF In Python

MIT License

In this project, we are going to batch-convert pdf files to text and extract data without using PyPDF2/4.

We're going to achieve that by:

Using PDFtoText converter from XPdf to convert pdf files to text files
Using regular expressions to extract data
Performing data cleaning using pandas
Exporting to Excel file

Why Not Use PyPDF2/4

Short Answer: I got this error:

TypeError: object of type 'IndirectObject' has no len()

Long Answer: If PyPDF4 had worked I would never have had a chance to explore other ways. I looked on StackOverflow however couldn't find a solution for this error. Obviously, there had to be someone with the same problem but there's no solution.

I was not willing to manually copy and paste the information from 52 of my payslips. Isn't that what programs are used for?

Table of Contents

Packages
Converting PDF To Text
Script Link

Packages

Pandas

Check out the requirements.txt

Converting PDF To Text

Converting PDF to text using Xpdf's pdftotext is really simple.

Using this command-line tool we can batch-convert PDFs to text files.

pdftotext source.pdf dest.txt

Script Link

Script Link: parse_payslips.py

Extract-Data-From-PDF-In-Python
Extract-Data-From-PDF-In-Python copied to clipboard

Metadata

Extract Data From PDF In Python

Why Not Use PyPDF2/4

Packages

Converting PDF To Text

Script Link

← Metadata

Owner

Metadata

Extract-Data-From-PDF-In-Python Extract-Data-From-PDF-In-Python copied to clipboard

Metadata

Extract Data From PDF In Python

Why Not Use PyPDF2/4

Packages

Converting PDF To Text

Script Link

← Metadata

Owner

Metadata

Extract-Data-From-PDF-In-Python
Extract-Data-From-PDF-In-Python copied to clipboard