pdfminer.six issues

Text Extraction: first character of LTTextLine totally disappears

1

Hi, I am trying to extract several text blocks (using pdfquery https://github.com/jcushman/pdfquery but it's mostly dependant of pdfminer backend). Most of the extractions work well but sometimes the first character...

NicoLivesey

type: bug

component:document

status: needs more info

encodingdb.name2unicode(name: str) -> str can't handle type1 font diff like: 2, /'MT110', /'MT50',...

1

the current version of encodingdb.name2unicode(name: str) -> str can't handle type1 font diff like: 2, /'MT110', /'MT50',... It'll decode the diff as cid3, cid 4., ... Compared with a previous...

dzhang228

type: bug

component:characters

status: needs more info

Debugging is slowing down our processing

2

Hello Guys, I recently integrated camelot to convert my pdf files to dataframes, with a fastapi upload process. Currently the processing time is taking 3mins per file after digging deeper...

professorr-x

type:performance

status: needs more info

Order of th text is mixed up and finding them in wrong places:

7

Order of th text is mixed up and finding them in wrong places: **I'm using the following code:** ``` output_string = StringIO() with open('/Users/udayallu/similarity_search_training/Pol_ProcHdbk1_23.pdf', 'rb') as in_file: parser = PDFParser(in_file)...

uday-allu

component: converter

status: needs more info

🐛 LTChar direct child of LTPage : 'LTChar' object is not iterable

4

## File for reproducing the bug [2.pdf](https://github.com/pdfminer/pdfminer.six/files/5399532/2.pdf) ## Description When running the following code from the [official documentation](https://pdfminersix.readthedocs.io/en/latest/tutorial/extract_pages.html) on the linked file : ```python from pdfminer.high_level import extract_pages from pdfminer.layout...

astariul

type: bug

component: converter

status: needs more info

Crash in PDFSimpleFont.init (& monkey patch workaround)

2

**Bug report** I'm seeing a crash in the latest release of pdfminer.six (20200726) with certain PDF files. Unfortunately for privacy reasons I can't share these. The crash is caused because...

eoinof

type:anomaly

status: needs more info

AttributeError: 'PDFParser' object has no attribute 'seek'

3

**Bug report** Environemnt: window64--Python 3.6 + Spyder 3.2.8 + pdfminer.six-20200726 ======code============ ```python import pdfminer from pdfminer.layout import LAParams from pdfminer.converter import PDFPageAggregator from pdfminer.pdfpage import PDFPage from pdfminer.layout import LTTextBoxHorizontal...

Ocruc28

type:anomaly

status: needs more info

pdf2txt.py cannot be found in windows anaconda environment

2

- A description of the bug once you install pdfminer.six in anaconda, you cannot run pdf2txt.py - Steps to reproduce the bug. Try to minimize the number of steps needed....

sagniknitr

type: question

status: needs more info

KeyError: 'Type' (in stream when opening some pdf)

4

**Bug report** _When loading a pdf file:_ The **Type** key is not in the **stream** dictionnary, which raise a KeyError. The pdf file I used is [here](https://www.ema.europa.eu/en/documents/product-information/cerdelga-epar-product-information_fr.pdf) Environment: macOS11.0.1 --Python...

BaptCha

type:anomaly

status: needs solution

PSEOF: Unexpected EOF when ATTACHMENTS are present in a pdf

9

This problem occurs when there are **ATTACHMENTS** present within a pdf file. I have provided a sample file in the below link: [attachment_test.pdf](https://github.com/pdfminer/pdfminer.six/files/5507157/attachment_test.pdf) Screenshot of an example file: ![image](https://user-images.githubusercontent.com/35597446/98484003-816c4e00-2232-11eb-9221-fb8ff64e76a2.png) _Originally...

SWARUP-Selvaraj

type:anomaly

component:parser

status: needs solution

pdfminer.six
pdfminer.six copied to clipboard

Metadata

Text Extraction: first character of LTTextLine totally disappears

encodingdb.name2unicode(name: str) -> str can't handle type1 font diff like: 2, /'MT110', /'MT50',...

Debugging is slowing down our processing

Order of th text is mixed up and finding them in wrong places:

🐛 LTChar direct child of LTPage : 'LTChar' object is not iterable

Crash in PDFSimpleFont.init (& monkey patch workaround)

AttributeError: 'PDFParser' object has no attribute 'seek'

pdf2txt.py cannot be found in windows anaconda environment

KeyError: 'Type' (in stream when opening some pdf)

PSEOF: Unexpected EOF when ATTACHMENTS are present in a pdf

← Metadata

Owner

Metadata

pdfminer.six pdfminer.six copied to clipboard

Metadata

← Metadata

Owner

Metadata

pdfminer.six
pdfminer.six copied to clipboard