pdfreader icon indicating copy to clipboard operation
pdfreader copied to clipboard

Python API for PDF documents

========= pdfreader

:Info: See the tutorials & documentation <https://pdfreader.readthedocs.io>_ for more information. :Author & Maintainer: Maksym Polshcha [email protected]

See GitHub <https://github.com/maxpmaxp/pdfreader>_ for the latest source.

About

pdfreader is a Pythonic API for: * extracting texts, images and other data from PDF documents (plain or protected) * accessing different objects within PDF documents

pdfreader is NOT a tool (maybe one day it become!): * to create or update PDF files * to split PDF files into pages or other pieces * convert PDFs to any other format

Nevertheless it can be used as a part of such tools.

See Tutorials & Documentation <https://pdfreader.readthedocs.io>_.

Features

  • Extracts texts (plain text and formatted text objects)
  • Extract PDF forms data (pure strings and formatted text objects)
  • Supports all PDF encodings, CMap, predefined cmaps.
  • Extracts images and image masks as Pillow/PIL Images <https://pillow.readthedocs.io/en/stable/reference/Image.html>_
  • Supports encrypted and password-protected PDF documents
  • Allows browse any document objects, resources and extract any data you need (fonts, annotations, metadata, multimedia, etc.)
  • Follows PDF-1.7 specification <https://opensource.adobe.com/dc-acrobat-sdk-docs/standards/pdfstandards/pdf/PDF32000_2008.pdf>_
  • Lazy objects access allows to process huge PDF documents quite fast

Installation

pdfreader can be installed with pip <http://pypi.python.org/pypi/pip>_::

$ python -m pip install pdfreader

Or easy_install from setuptools <http://pypi.python.org/pypi/setuptools>_::

$ python -m easy_install pdfreader

You can also download the project source and do::

$ python setup.py install

Tutorial and Documentation

Tutorial, real-life examples and documentation <https://pdfreader.readthedocs.io>_

Support, Bugs & Feature Requests

pdfreader uses GitHub issues <https://github.com/maxpmaxp/pdfreader/issues>_ to keep track of bugs, feature requests, etc.

Related Projects

  • pdfminer <https://github.com/euske/pdfminer>_
  • pyPdf2 <https://github.com/py-pdf/PyPDF2>_
  • xpdf <http://www.foolabs.com/xpdf/>_
  • pdfbox <http://pdfbox.apache.org/>_
  • mupdf <http://mupdf.com/>_

References

  • Document management - Potable document format - PDF 1.7 <https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf>_
  • Adobe CMap and CIDFont Files Specification <https://www.adobe.com/content/dam/acom/en/devnet/font/pdfs/5014.CIDFont_Spec.pdf>_
  • PostScript Language Reference Manual <https://www-cdf.fnal.gov/offline/PostScript/PLRM2.pdf>_
  • Adobe CMap resources <https://github.com/adobe-type-tools/cmap-resources>_
  • Adobe glyph list specification (AGL) <https://github.com/adobe-type-tools/agl-specification>_

Donation

If this project is helpful, you can treat me to coffee :-)

.. image:: https://www.paypalobjects.com/en_US/i/btn/btn_donateCC_LG.gif :target: https://www.paypal.com/cgi-bin/webscr?cmd=_donations&business=VMVFZSDHDFVK6&item_name=PDFReader+support&currency_code=USD&source=url