pdfminer.six issues

Bad HTML markup generated

2

**Bug report** **- A description of the bug** Bad HTML markup generated while using `pdf2txt.py test.pdf -t html -o test.html` **- Steps to reproduce the bug.** 1. Use the following...

andrei-volkau

type: bug

component: converter

status: needs solution

pdfminer.psparser.PSSyntaxError: Invalid dictionary construct

6

In a call to `get_pages`, this PDF raised an exception. pdfminer version: refs/tags/20201018 PDF: https://source.android.com/compatibility/5.1/android-5.1-cdd.pdf My code looks like this: ```python raw_input = io.BytesIO(content) # The file contents html_output =...

markmcd

type:anomaly

status: needs solution

extract_text mixes lines

7

Hi, I am not able to find any combination of LAParams to correctly convert attached simple PDF to text. In the resulting text lines do not have correct sequence: Expected...

Ev2geny

type: question

status: needs solution

No /Root object! - Is this really a PDF?

5

Hi, I've got this PDF (see attachment) which opens just fine in a PDF viewer but fails to get parsed: ``` PDFSyntaxError Traceback (most recent call last) in () 7...

micmalti

type:anomaly

status: needs solution

Handle XREFs with missing startxref after trailer

1

I have some PDF documents that look like: ``` %PDF %objects xref %table trailer

bmteller

type:anomaly

status: needs solution

PDFObjRef is not iterable

**Bug report** Copy of #471 (by @imochoa) Sadly, I cannot upload the problematic PDFs due a non-disclosure agreement. I can however point out the issue and share my fix. When...

pietermarsman

type: bug

status: needs solution

Math formula position are detected wrongly

5

### Description In my PDF I have some math formula. I encounter no problem when reading the file with `pdfminer`, but the position of the math is wrong. Because of...

astariul

type: bug

component: converter

status: needs solution

Add hOCR output type for pdf2txt

5

hOCR is an open standard of data representation for formatted text obtained from optical character recognition (OCR).

hason

type: new feature

status: needs solution

Use git lfs for storing sample pdfs

2

Currently, we have a couple of pdf's as test case. However, a lot of bug reports come with problematic pdf's. It would be great if we could add regression tests...

pietermarsman

type: development

status: needs solution

Improve distance function for textboxes

1

The current distance function computes the area between two textboxes. This can prioritize the grouping of textboxes A and B, while C is in between A and B. This is...

pietermarsman

type: new feature

status: needs solution

pdfminer.six
pdfminer.six copied to clipboard

Metadata

Bad HTML markup generated

pdfminer.psparser.PSSyntaxError: Invalid dictionary construct

extract_text mixes lines

No /Root object! - Is this really a PDF?

Handle XREFs with missing startxref after trailer

PDFObjRef is not iterable

Math formula position are detected wrongly

Add hOCR output type for pdf2txt

Use git lfs for storing sample pdfs

Improve distance function for textboxes

← Metadata

Owner

Metadata

pdfminer.six pdfminer.six copied to clipboard

Metadata

← Metadata

Owner

Metadata

pdfminer.six
pdfminer.six copied to clipboard