PyMuPDF PyMuPDF Pro 1.25.0: A 1-page .docx file is split into 4 pages

PyMuPDF Pro 1.25.0: A 1-page .docx file is split into 4 pages

Open trianxy opened this issue 10 months ago • 2 comments

Description of the bug

For some documents, PyMuPDF Pro splits the document into many more pages than if I open the document with Google Docs (or Mac Pages/libreoffice).

This creates several downstream problems (example: exporting first page as png via page.get_pixmap().tobytes(output="png") won't match the expected first page).

How to reproduce the bug

Download the attached 1page-is-split-into-4pages.docx and run

import pymupdf.pro
pymupdf.pro.unlock()  # use a trial key to see output of 4th page etc.

document = pymupdf.open("1page-is-split-into-4pages.docx")
for page in document:
    print(page)
    print(page.get_text())

and observe that pymupdf recognizes 4 pages, although if you open it in Google Docs (or Mac's Pages, or libreoffice), it shows as 1 page.

PyMuPDF version

1.25.0

Operating system

Linux

Python version

3.9

Dec 17 '24 13:12 trianxy

PyMuPDF PyMuPDF copied to clipboard

PyMuPDF Pro 1.25.0: A 1-page .docx file is split into 4 pages

Description of the bug

How to reproduce the bug

PyMuPDF version

Operating system

Python version

PyMuPDF
PyMuPDF copied to clipboard