Refactor pdf2txt.py into an entry point
It is impossible to use pdf2txt.py, the main PDF 2 Text script of the package, from a Python function without calling a shell. This adds additional unnecessary overhead in addition to encoding issues.
This could all be easily avoidable by refactoring pdf2txt.py as an entry point, as described here and here.
This would have to look something like this:
entry_points = {'console_scripts': [
'pdf2txt=pdfminer.pdf2txt:main',
...
]},
https://github.com/pdfminer/pdfminer.six/issues/27#issuecomment-274097265
I realized that this is not as easy as I thought. The console script needs to be in the pdfminer package, but it is not. It is in the tools/ directory.
Mayb ewe should move the content of tools/ to the package as well and just keep some aliases there to not have any breaking changes?
I also realize now that this question already has a perfectly valid solution. To use pdfminer from a Python function (or anywhere) you can use the high-level api:
from pdfminer.high_level import extract_text
text = extract_text("example.pdf")
print(text)
All the features from the command-line are also available throught extract_text and related methods.
The request to add pdf2txt as an entrypoint is also made in issue #724, but it is slighly more on-point so I'm keeping that one open to track that request.