pdfminer.six icon indicating copy to clipboard operation
pdfminer.six copied to clipboard

Refactor pdf2txt.py into an entry point

Open lrq3000 opened this issue 6 years ago • 1 comments

It is impossible to use pdf2txt.py, the main PDF 2 Text script of the package, from a Python function without calling a shell. This adds additional unnecessary overhead in addition to encoding issues.

This could all be easily avoidable by refactoring pdf2txt.py as an entry point, as described here and here.

lrq3000 avatar Jan 01 '19 18:01 lrq3000

This would have to look something like this:

entry_points = {'console_scripts': [
    'pdf2txt=pdfminer.pdf2txt:main',
    ...
]},

https://github.com/pdfminer/pdfminer.six/issues/27#issuecomment-274097265

pietermarsman avatar Jul 14 '19 14:07 pietermarsman

I realized that this is not as easy as I thought. The console script needs to be in the pdfminer package, but it is not. It is in the tools/ directory.

Mayb ewe should move the content of tools/ to the package as well and just keep some aliases there to not have any breaking changes?

pietermarsman avatar Aug 14 '22 10:08 pietermarsman

I also realize now that this question already has a perfectly valid solution. To use pdfminer from a Python function (or anywhere) you can use the high-level api:

from pdfminer.high_level import extract_text

text = extract_text("example.pdf")
print(text)

All the features from the command-line are also available throught extract_text and related methods.

pietermarsman avatar Aug 14 '22 10:08 pietermarsman

The request to add pdf2txt as an entrypoint is also made in issue #724, but it is slighly more on-point so I'm keeping that one open to track that request.

pietermarsman avatar Aug 14 '22 10:08 pietermarsman