pdfminer.six
pdfminer.six copied to clipboard
Community maintained fork of pdfminer - we fathom PDF
**Bug report** **- A description of the bug** Bad HTML markup generated while using `pdf2txt.py test.pdf -t html -o test.html` **- Steps to reproduce the bug.** 1. Use the following...
In a call to `get_pages`, this PDF raised an exception. pdfminer version: refs/tags/20201018 PDF: https://source.android.com/compatibility/5.1/android-5.1-cdd.pdf My code looks like this: ```python raw_input = io.BytesIO(content) # The file contents html_output =...
Hi, I am not able to find any combination of LAParams to correctly convert attached simple PDF to text. In the resulting text lines do not have correct sequence: Expected...
Hi, I've got this PDF (see attachment) which opens just fine in a PDF viewer but fails to get parsed: ``` PDFSyntaxError Traceback (most recent call last) in () 7...
I have some PDF documents that look like: ``` %PDF %objects xref %table trailer
**Bug report** Copy of #471 (by @imochoa) Sadly, I cannot upload the problematic PDFs due a non-disclosure agreement. I can however point out the issue and share my fix. When...
### Description In my PDF I have some math formula. I encounter no problem when reading the file with `pdfminer`, but the position of the math is wrong. Because of...
hOCR is an open standard of data representation for formatted text obtained from optical character recognition (OCR).
Currently, we have a couple of pdf's as test case. However, a lot of bug reports come with problematic pdf's. It would be great if we could add regression tests...
The current distance function computes the area between two textboxes. This can prioritize the grouping of textboxes A and B, while C is in between A and B. This is...