pypdf issues

"extract_text" doesn't output the same transformation matrix in version 3.17 as in 3.16.

9

I'm trying to extract text from a pdf together with the position of the text. When I do it in pypdf 3.16 I get the expected result, but I don't...

ghbm-itk

workflow-advanced-text-extraction

Execute button in a pdf form

## Explanation Hello, I am exploring how to populate a pdf form using pypdf. The pdf form I am working on is the following one: https://www.uspto.gov/sites/default/files/patents/process/file/efs/guidance/updated_IDS.pdf It is used for...

slimbeji-pb

workflow-forms

Spaces (that do not exist in the original PDF) appear in the output of extract_text()

4

I am trying to parse [this PDF](https://www.joinville.sc.gov.br/wp-content/uploads/2023/11/Pesquisa-de-Precos-Combustiveis-novembro-2023.pdf). However, I am getting on the output of extract_text() a bunch of spaces that are not in the original PDF. See the screenshot...

renanbirck

is-bug

workflow-text-extraction

Has MCVE

help wanted

whitespace

MAINT: Simplify file identifiers generation

5

exiledkingcc

on-hold

extract_text() return garbled characters

6

I get garbled characters when parsing pdf file. The file I use is [this](http://www.aas.net.cn/fileZDHXB/journal/article/zdhxb/2012/8/PDF/20120812.pdf). There may be encoding issues? ## Environment ```bash $ python -m platform Linux-4.18.0-147.5.1.6.h841.eulerosv2r9.x86_64-x86_64-with-glibc2.17 $ python -c...

ChanghaoLau

workflow-text-extraction

Has MCVE

ENH: Merge improvement

2

proposal to complete #2203

pubpub-zz

Sel fontinfields

6

add capability to change font and size closes #2253

pubpub-zz

help wanted

STY: Same attributes between PdfReader and PdfWriter

4

provides the same interface to access root,info,id for communalisation The objective is prepare some code factorization between PdfWriter / PdfReader

pubpub-zz

'not enough image data' exception from PIL

I am trying to extract images from pdf files, however occasionally it gives 'not enough image data' exception from PIL when handling certain pdf. The files look correct in Atril...

brianpow

is-bug

workflow-images

Has MCVE

Microsoft Word table of contents Link annotation error.

I am trying to use PdfReader and PdfWriter to read/write annotations in pdf file. I use PDF file produced by Microsoft Word -> Save As PDF. Word file has 3...

vokson

pypdf
pypdf copied to clipboard

Metadata

"extract_text" doesn't output the same transformation matrix in version 3.17 as in 3.16.

Execute button in a pdf form

Spaces (that do not exist in the original PDF) appear in the output of extract_text()

MAINT: Simplify file identifiers generation

extract_text() return garbled characters

ENH: Merge improvement

Sel fontinfields

STY: Same attributes between PdfReader and PdfWriter

'not enough image data' exception from PIL

Microsoft Word table of contents Link annotation error.

← Metadata

Owner

Metadata

pypdf pypdf copied to clipboard

Metadata

← Metadata

Owner

Metadata

pypdf
pypdf copied to clipboard