PyMuPDF Linebreak inserted between each letter

Linebreak inserted between each letter

Open rezemika opened this issue 1 year ago • 1 comments

Description of the bug

Hey, thank you so much for this amazing tool!

I am using PyMuPDF to parse many official french documents, they contain a cover, a table of contents, and pages of scanned content. The vast majority of them is read with no problem, but for a small number of them, a linebreak is inserted between each letter of the content, making it almost unreadable.

Here are links to a few documents where this happens:

How to reproduce the bug

For instance, here is an example with the second mentioned document:

>>> import pymupdf
>>> f = "2023-04-28-ee04e9ccb016e7806a7cf92a48155834.pdf"
>>> doc = pymupdf.Document(f)
>>> doc[0].get_text("blocks")
[
    (164.6999969482422, 377.63739013671875, 436.3139953613281, 394.6753845214844, 'R\nE\nC\nU\nE\nI\nL\n \nD\nE\nS\n \nA\nC\nT\nE\nS\n \nA\nD\nMI\nN\nI\nS\nT\nR\nA\nT\nI\nF\nS\n', 0, 0),
    (225.0, 531.0374145507812, 376.00396728515625, 548.0614013671875, 'n\n°\n \n7\n7\n \nd\nu\n \n2\n8\n \na\nv\nr\ni\nl\n \n2\n0\n2\n3\n', 1, 0)
]

>>> pymupdf.version
('1.24.7', '1.24.4', '20240626000001')

And here is its first page as I see it:

Cover of the second mentioned document.

Please let me know if I can provide any further information!

PS: Is there any "debugging tool" that would allow you to view text and content blocks as they're seen by PyMuPDF for easier analysis?

PyMuPDF version

1.24.7

Operating system

Linux

Python version

3.11

Jul 02 '24 14:07 rezemika

PyMuPDF PyMuPDF copied to clipboard

Linebreak inserted between each letter

Description of the bug

How to reproduce the bug

PyMuPDF version

Operating system

Python version

PyMuPDF
PyMuPDF copied to clipboard